Reads were mapped to the Christensen 2018 chinook genome and SNPs were called using freebayes. Filtered data set was haplotyped using rad_haplotyper. Fst-outlier were identified using FDIST method implemented in Arlequin and bayescan.
##
## FULL DATA SET: all SNP-containing loci/all individuals grouped by run type.
## /// GENIND OBJECT /////////
##
## // 386 individuals; 12,983 loci; 30,037 alleles; size: 52.2 Mb
##
## // Basic content
## @tab: 386 x 30037 matrix of allele counts
## @loc.n.all: number of alleles per locus (range: 2-9)
## @loc.fac: locus factor for the 30037 columns of @tab
## @all.names: list of allele names for each locus
## @ploidy: ploidy of each individual (range: 2-2)
## @type: codom
## @call: .local(x = x, i = i, j = j, drop = drop)
##
## // Optional content
## @pop: population of each individual (group size range: 21-270)
## @strata: a data frame with 10 columns ( LIB, LIB_ID, SAMPLE_ID, YEAR, SOURCE, RUN, ... )
Number of samples per tributary and run type.
| LOCATION | Fall | Late-Fall | Winter | Spring |
|---|---|---|---|---|
| COL | 30 | - | - | - |
| MIL | 20 | - | - | 16 |
| DER | 15 | - | - | 27 |
| BUT | 21 | - | - | 19 |
| FRH | 27 | - | - | 7 |
| NIM | 30 | - | - | - |
| MKH | 28 | - | - | - |
| STN | 23 | - | - | - |
| TOU | 30 | - | - | - |
| MRH | 15 | - | - | - |
| MER | 31 | - | - | - |
| USR | - | 21 | 26 | - |
For details on bioinformatics on genotyping the data set see Supplementary Material I
Population structure was explored using two methods, a clustering analysis based on genetic similarity and an assessment of population differentiation among individuals grouped a priori based on run type and tributary.
Reduce the data’s dimensionality using a principle component analysis and project genotypes into two dimensional space visualize sample’s distances.
# create matrix with centered/scaled allele frequencies
# will also replace missing data
X <- scaleGen(gen, NA.method = "mean")
# perform PCA/retain top 3 PCs
PCA <- dudi.pca(X, center = FALSE, scale = FALSE,
scannf = FALSE, nf = 10)
# extract Eigenvalues of each PC & calculate % variance explained by each PC
eig <- eigenvalues(PCA)
# # plot %variance explained by of top 25 PCs
# plot.eigen.variance(eig)
# Individuals' contribution to PCs/calculate Loading per individual and PC
PC_ind <- PC.ind(PCA)
# join individuals' contribution to PC with strata
PC_ind <- left_join(PC_ind, strata)
PC1 (2.5% of variation) separates Winter run individuals as being clearly distinct. Spring individuals cluster directly adjacent to Fall. PC2 (1.1% of variation) has a tight cluster of Late-Fall and Fall run individuals with Spring run clustering in a more spread out fashion. Spring Feather River Hatchery individuals cluster on the outskirts of the Fall cluster with Fall Feather River wild individuals.
Principal Component Analysis. Color of each data point indicates tributary of that individual. Shape of data point indicates run type.
Individuals were clustered into K = 2 – 10 groups using k-means
clustering based on the PCA-transformed genotype matrix (i.e. no
assumptions regarding Hardy-Weinberg or linkage disequilibrium) followed
by a discriminant analysis of principle components (DAPC) to determine
the membership probabilities of each sample to each inferred cluster as
implemented in adegenet (Jombart, Devillard, and Balloux
2010). During DAPC, the genotype matrix is first
PCA-transformed, then samples are partitioned into a within- and
between-group component to maximize discrimination between groups using
linear discriminant analysis, and finally, the membership probability of
each sample to each cluster is calculated. To ensure sufficient variance
was retained to discriminate among groups but not overfit the data, the
optimum number of principle components to retain was determined using a
stratified cross-validation of DAPC.
Use k-means clustering to identify clusters based on genetic similarity (data transformed using PCA). For clustering it is appropriate to retain all PCs (all variance in the data set), as overfitting is not an issue.
Figure S1: AIC for data clustered into K = 1 - 40 groups using k-means clustering.
AIC is minimized for K=4.
Perform stratified cross-validation of DAPC using a range of retained PCs, while keeping the number of discriminant functions fixed for K = 2 - 10 to determine the most appropriate number of principle components to retain sufficient variance to discriminate but not overfit the data.
## define minimum and maximum K to cluster for ----
min_K <- 2
max_K <- 10
# number of repetitions for x-validation
n_rep <- 30
# range of values to loop over
range <- min_K:max_K
## loop over values for K ----
# list for cross validation plots
xVal_plots <- list()
# data frame for optimum PCs
opt_PC <- data.frame(matrix(ncol = 4, nrow = 0)) %>%
rename(k_clust = X1,
optPC = X2,
variance = X3,
mean_opt_success = X4)
# list for membership plots
memb_plots <- list()
# run loop
for (k in range){
# set value for K
k_clust <- k
# cluster
grp <- find.clusters.genind(gen, n.pca = 500, n.clust = k_clust, method = "kmeans")
# scale allele frq/missing data replaced with mean
X <- scaleGen(gen, NA.method = "mean")
xval <- xvalDapc(X, grp$grp,
n.pca.max = 400, training.set = 0.9,
result = "groupMean", center = TRUE, scale = FALSE,
n.pca = NULL, n.rep = n_rep, xval.plot = FALSE)
# extract assignment success for each xVal step
PCs <- data.frame(N_PCs = xval[["Cross-Validation Results"]]$n.pca,
PROP_SUCCESS = xval[["Cross-Validation Results"]]$success)
# create crossvalidation plot
xVal_plots[[k_clust]] <- ggplot(PCs, aes(x = N_PCs, y = PROP_SUCCESS)) +
stat_density2d(aes(fill = ..density..^0.25),
geom = "tile", contour = FALSE, n = 200) +
geom_point() +
scale_fill_continuous(low = "white", high = "dodgerblue4") +
scale_y_continuous(limits = c(0, 1)) +
labs(title = glue("Cross validation K = {k_clust}"),
x = "retained PCs", y = "assignment success") +
theme_standard +
theme(legend.position = "none")
# optimum number of PCs to retain
retain <- as.numeric(xval$`Number of PCs Achieving Highest Mean Success`)
variance <- round(xval[["DAPC"]]$var*100, digits = 2)
opt <- PCs %>%
filter(N_PCs == retain)
df <- as.data.frame(x = k_clust) %>%
mutate(optPC = retain,
variance = variance,
mean_opt_success = round(mean(opt$PROP_SUCCESS), digits = 2))
opt_PC <- opt_PC %>%
bind_rows(df)
# extract membership assignment probabilities
grp_membership <- as.data.frame(xval[["DAPC"]]$posterior) %>%
rownames_to_column("LIB_ID") %>%
gather(key = GRP, value = MEMBSHIP, 2:(k_clust+1)) %>%
left_join(strata) %>%
unite(RUN_LOC, RUN, LOCATION, sep = "_", remove = FALSE) %>%
arrange(RUN_LOC, LIB_ID) %>%
mutate(LIB_ID = factor(LIB_ID, levels = unique(LIB_ID)))
path <- as.character(glue("results/runs_K{k}.grp.membership"))
write_delim(grp_membership, path, delim = "\t")
# create membership plot
memb_plots[[k_clust]] <- ggplot(grp_membership,
aes(x = LIB_ID, y = MEMBSHIP, fill = GRP, color = GRP)) +
geom_bar(stat = "identity") +
facet_grid(. ~ RUN_LOC, scales = "free", space = "free") +
scale_fill_manual(values = col_nine) +
scale_color_manual(values = col_nine) +
labs(x = "INDV", y = "memb. prob") +
theme_standard +
theme(axis.text.x = element_blank(),
strip.text.x = element_text(size = 6, color = "black"),
strip.text.y = element_text(size = 6, color = "black"),
legend.position = "none")
}
write_delim(opt_PC, "results/kmeans_all-loci.optPCs", delim = "\t")
Retaining too few PCs may result in important variance not being retained and therefore not informing the clustering analysis, while retaining too many PCs results in overfitting, i.e. assignment success decreases.
Proportion of test individuals correctly assigned to their cluster of origin based on number of retained PCs for K = 2 - 9.
Table S1: The optimum number of principle components,%-variance retained and the mean optimum assignment succes for K = 2 -10 clusters.
| k_clust | optPC | variance | mean_opt_success |
|---|---|---|---|
| 2 | 50 | 63.49 | 1.00 |
| 3 | 50 | 21.77 | 1.00 |
| 4 | 100 | 36.98 | 0.99 |
| 5 | 100 | 36.98 | 0.97 |
| 7 | 50 | 21.77 | 0.97 |
| 8 | 20 | 11.62 | 0.96 |
| 6 | 50 | 21.77 | 0.95 |
| 9 | 40 | 18.50 | 0.87 |
| 10 | 80 | 31.08 | 0.82 |
Genetic data is scaled and centered, then transformed using a PCA. Retained PCs are then transmitted to a linear discriminant analysis.
Calculate membership probability of each individual to per cluster.
Membership probability of each individual to clusters identified using k-means hierarchical clustering (equiavalent to Figure 2 in manuscript).
Tributaries are grouped within runs by tributary from north to south as Fall_BUT, Fall_COL, Fall_DER, Fall_FRH, Fall_MER, Fall_MIL, Fall_MKH, Fall_MRH, Fall_NIM, Fall_STN, Fall_TOU, Late-Fall_USR, Spring_BUT, Spring_DER, Spring_FRH, Spring_MIL, and Winter_USR (individual panels).
Weir & Cockerham’s unbiased estimator of FST (Weir and Cockerham 1984) was used to calculate population differentiation among individuals grouped a priori by run type within tributary.
To test for genetic heterogeneity in the data set, global FST was calculated across all groups.
# group genetic data for comparison
setPop(gen) <- ~RUN_LOC
pop <- popNames(gen)
# calculate Fst
tidy <- tidy_genomic_data(data = gen, filename = NULL, parallel = 65)
fst <- assigner::fst_WC84(data = tidy,
pop.levels = pop,
holdout.samples = NULL,
pairwise = TRUE,
ci = TRUE,
iteration.ci = 1000,
quantiles.ci = c(0.025, 0.975),
digits = 9,
verbose = TRUE)
# write results
write_delim(fst$fst.overall, "results/trib_allLoc.globalfst.ci", delim = "\t")
write_delim(fst[["fst.ranked"]], "results/trib_ranked.perlocus.fst", delim = "\t")
write_delim(fst[["fis.markers"]], "results/trib_perlocus.fis", delim = "\t")
write_delim(fst$pairwise.fst, "results/trib_allLoc.fst.ci", delim = "\t")
write_delim(fst$sigma.loc, "results/trib_perlocus.sigma", delim = "\t")
write_delim(fst$fis.markers, "trib_perlocus.fis", delim = "\t")
write_delim(fst$fis.overall, "results/trib_allLoc.fis", delim = "\t")
Significant heterogeneity was detected among
individuals grouped by run/tributary. Significance was determined using
95% confidence intervals around each estimate generated by re-sampling
loci 1,000 times using assigner (Gosselin, Anderson, and Bradbury
2016).
Global Fst and 95% CI for individuals grouped by runtype.
| FST | N_MARKERS | CI_LOW | CI_HIGH |
|---|---|---|---|
| 0.0318754 | 12886 | 0.0309843 | 0.032883 |
Pairwise FST was calculated as a post hoc test for pairwise differences among groups. Significance was determined using 95% confidence intervals around each estimate generated by re-sampling loci 1,000 times.
Table S2: Pairwise Fst among runs basins.
| POP1 | POP2 | COMP | N_MARKERS | FST | CI_LOW | CI_HIGH |
|---|---|---|---|---|---|---|
| F_MRH | F_DER | F-F | 9448 | 0.018 | 0.017 | 0.019 |
| F_MRH | F_MIL | F-F | 9594 | 0.016 | 0.015 | 0.017 |
| F_MRH | F_BUT | F-F | 9600 | 0.015 | 0.013 | 0.016 |
| F_DER | F_MIL | F-F | 9678 | 0.014 | 0.013 | 0.016 |
| F_MRH | F_STN | F-F | 9688 | 0.013 | 0.012 | 0.014 |
| F_DER | F_STN | F-F | 9784 | 0.012 | 0.011 | 0.013 |
| F_DER | F_BUT | F-F | 9667 | 0.012 | 0.010 | 0.012 |
| F_MIL | F_BUT | F-F | 9798 | 0.011 | 0.010 | 0.012 |
| F_MIL | F_STN | F-F | 9896 | 0.011 | 0.010 | 0.012 |
| F_BUT | F_STN | F-F | 9848 | 0.008 | 0.007 | 0.009 |
| F_FRH | F_MRH | F-F | 10026 | 0.008 | 0.007 | 0.009 |
| F_MRH | F_COL | F-F | 10129 | 0.008 | 0.007 | 0.008 |
| F_MRH | F_TOU | F-F | 9863 | 0.007 | 0.006 | 0.008 |
| F_FRH | F_DER | F-F | 10113 | 0.007 | 0.006 | 0.008 |
| F_DER | F_TOU | F-F | 9939 | 0.007 | 0.006 | 0.007 |
| F_MKH | F_MRH | F-F | 9852 | 0.006 | 0.006 | 0.007 |
| F_MIL | F_TOU | F-F | 10068 | 0.006 | 0.005 | 0.007 |
| F_FRH | F_BUT | F-F | 10186 | 0.006 | 0.005 | 0.007 |
| F_FRH | F_MIL | F-F | 10227 | 0.006 | 0.005 | 0.007 |
| F_COL | F_STN | F-F | 10328 | 0.006 | 0.005 | 0.007 |
| F_MKH | F_DER | F-F | 9934 | 0.006 | 0.005 | 0.007 |
| F_FRH | F_STN | F-F | 10250 | 0.006 | 0.005 | 0.007 |
| F_MKH | F_MIL | F-F | 10034 | 0.006 | 0.005 | 0.007 |
| F_MRH | F_NIM | F-F | 9948 | 0.006 | 0.005 | 0.006 |
| F_DER | F_NIM | F-F | 10047 | 0.006 | 0.005 | 0.006 |
| F_DER | F_COL | F-F | 10130 | 0.005 | 0.004 | 0.006 |
| F_BUT | F_COL | F-F | 10250 | 0.005 | 0.004 | 0.006 |
| F_MRH | F_MER | F-F | 9956 | 0.005 | 0.004 | 0.006 |
| F_MIL | F_NIM | F-F | 10146 | 0.005 | 0.004 | 0.005 |
| F_DER | F_MER | F-F | 10048 | 0.005 | 0.004 | 0.005 |
| F_MIL | F_COL | F-F | 10249 | 0.005 | 0.004 | 0.005 |
| F_MIL | F_MER | F-F | 10151 | 0.004 | 0.004 | 0.005 |
| F_BUT | F_TOU | F-F | 10024 | 0.004 | 0.003 | 0.005 |
| F_MKH | F_BUT | F-F | 10005 | 0.004 | 0.003 | 0.005 |
| F_TOU | F_COL | F-F | 10432 | 0.004 | 0.003 | 0.004 |
| F_BUT | F_NIM | F-F | 10106 | 0.003 | 0.003 | 0.004 |
| F_MKH | F_STN | F-F | 10061 | 0.003 | 0.002 | 0.004 |
| F_TOU | F_STN | F-F | 10075 | 0.003 | 0.002 | 0.004 |
| F_NIM | F_COL | F-F | 10479 | 0.003 | 0.002 | 0.004 |
| F_MKH | F_COL | F-F | 10426 | 0.003 | 0.002 | 0.003 |
| F_FRH | F_TOU | F-F | 10387 | 0.003 | 0.002 | 0.003 |
| F_BUT | F_MER | F-F | 10116 | 0.002 | 0.002 | 0.003 |
| F_NIM | F_STN | F-F | 10144 | 0.002 | 0.001 | 0.003 |
| F_MKH | F_FRH | F-F | 10358 | 0.002 | 0.001 | 0.003 |
| F_MER | F_STN | F-F | 10173 | 0.002 | 0.001 | 0.003 |
| F_MER | F_COL | F-F | 10495 | 0.002 | 0.001 | 0.002 |
| F_FRH | F_NIM | F-F | 10433 | 0.002 | 0.001 | 0.002 |
| F_FRH | F_COL | F-F | 10545 | 0.001 | 0.001 | 0.002 |
| F_FRH | F_MER | F-F | 10417 | 0.001 | 0.000 | 0.002 |
| F_MKH | F_TOU | F-F | 10186 | 0.000 | 0.000 | 0.001 |
| F_MKH | F_MER | F-F | 10252 | 0.000 | 0.000 | 0.000 |
| F_MKH | F_NIM | F-F | 10259 | 0.000 | 0.000 | 0.000 |
| F_TOU | F_MER | F-F | 10282 | 0.000 | 0.000 | 0.000 |
| F_TOU | F_NIM | F-F | 10306 | 0.000 | 0.000 | 0.000 |
| F_MER | F_NIM | F-F | 10336 | 0.000 | 0.000 | 0.000 |
| S_BUT | S_FRH | S-S | 8863 | 0.050 | 0.048 | 0.052 |
| S_MIL | S_FRH | S-S | 9009 | 0.047 | 0.045 | 0.049 |
| S_MIL | S_BUT | S-S | 9226 | 0.036 | 0.034 | 0.038 |
| S_DER | S_FRH | S-S | 9710 | 0.027 | 0.025 | 0.028 |
| S_DER | S_BUT | S-S | 9764 | 0.023 | 0.022 | 0.024 |
| S_DER | S_MIL | S-S | 9838 | 0.011 | 0.010 | 0.012 |
| F_MRH | L_USR | F-L | 9578 | 0.019 | 0.018 | 0.020 |
| F_DER | L_USR | F-L | 9665 | 0.017 | 0.016 | 0.019 |
| F_MIL | L_USR | F-L | 9831 | 0.016 | 0.015 | 0.017 |
| F_BUT | L_USR | F-L | 9749 | 0.015 | 0.014 | 0.016 |
| F_STN | L_USR | F-L | 9888 | 0.015 | 0.014 | 0.016 |
| F_FRH | L_USR | F-L | 10243 | 0.012 | 0.011 | 0.013 |
| F_MKH | L_USR | F-L | 10074 | 0.012 | 0.011 | 0.013 |
| F_TOU | L_USR | F-L | 10052 | 0.011 | 0.010 | 0.012 |
| F_COL | L_USR | F-L | 10252 | 0.010 | 0.009 | 0.012 |
| F_NIM | L_USR | F-L | 10151 | 0.010 | 0.009 | 0.011 |
| F_MER | L_USR | F-L | 10159 | 0.009 | 0.008 | 0.009 |
| F_STN | W_USR | F-W | 9545 | 0.162 | 0.157 | 0.167 |
| F_BUT | W_USR | F-W | 9370 | 0.156 | 0.151 | 0.162 |
| F_MKH | W_USR | F-W | 9744 | 0.156 | 0.151 | 0.161 |
| F_TOU | W_USR | F-W | 9800 | 0.156 | 0.151 | 0.161 |
| F_NIM | W_USR | F-W | 9920 | 0.156 | 0.151 | 0.161 |
| F_MER | W_USR | F-W | 9914 | 0.152 | 0.147 | 0.158 |
| F_MRH | W_USR | F-W | 9052 | 0.152 | 0.147 | 0.157 |
| F_MIL | W_USR | F-W | 9385 | 0.152 | 0.146 | 0.157 |
| F_DER | W_USR | F-W | 9191 | 0.149 | 0.144 | 0.154 |
| F_FRH | W_USR | F-W | 9946 | 0.146 | 0.141 | 0.151 |
| F_COL | W_USR | F-W | 9934 | 0.141 | 0.136 | 0.146 |
| F_MRH | S_BUT | F-S | 9420 | 0.051 | 0.049 | 0.053 |
| F_STN | S_BUT | F-S | 9874 | 0.049 | 0.047 | 0.052 |
| F_DER | S_BUT | F-S | 9561 | 0.049 | 0.047 | 0.051 |
| F_BUT | S_BUT | F-S | 9720 | 0.047 | 0.045 | 0.050 |
| F_MIL | S_BUT | F-S | 9721 | 0.047 | 0.045 | 0.049 |
| F_MKH | S_BUT | F-S | 10075 | 0.044 | 0.042 | 0.046 |
| F_TOU | S_BUT | F-S | 10045 | 0.044 | 0.042 | 0.046 |
| F_MRH | S_MIL | F-S | 9570 | 0.044 | 0.042 | 0.046 |
| F_NIM | S_BUT | F-S | 10190 | 0.043 | 0.041 | 0.045 |
| F_MER | S_BUT | F-S | 10185 | 0.040 | 0.039 | 0.042 |
| F_MRH | S_FRH | F-S | 8928 | 0.039 | 0.037 | 0.041 |
| F_DER | S_MIL | F-S | 9650 | 0.038 | 0.036 | 0.040 |
| F_FRH | S_BUT | F-S | 10190 | 0.037 | 0.035 | 0.039 |
| F_STN | S_MIL | F-S | 9913 | 0.036 | 0.034 | 0.038 |
| F_COL | S_BUT | F-S | 10263 | 0.036 | 0.034 | 0.038 |
| F_MIL | S_MIL | F-S | 9805 | 0.036 | 0.034 | 0.038 |
| F_DER | S_FRH | F-S | 9050 | 0.036 | 0.034 | 0.038 |
| F_BUT | S_MIL | F-S | 9799 | 0.036 | 0.034 | 0.038 |
| F_MKH | S_MIL | F-S | 10111 | 0.031 | 0.029 | 0.032 |
| F_TOU | S_MIL | F-S | 10112 | 0.031 | 0.029 | 0.032 |
| F_STN | S_DER | F-S | 10413 | 0.030 | 0.029 | 0.032 |
| F_MRH | S_DER | F-S | 10115 | 0.030 | 0.028 | 0.032 |
| F_MIL | S_FRH | F-S | 9266 | 0.030 | 0.028 | 0.031 |
| F_BUT | S_FRH | F-S | 9264 | 0.029 | 0.028 | 0.031 |
| F_BUT | S_DER | F-S | 10302 | 0.029 | 0.027 | 0.031 |
| F_NIM | S_MIL | F-S | 10235 | 0.029 | 0.027 | 0.030 |
| F_DER | S_DER | F-S | 10189 | 0.029 | 0.027 | 0.030 |
| F_MKH | S_DER | F-S | 10546 | 0.028 | 0.027 | 0.030 |
| F_NIM | S_DER | F-S | 10662 | 0.028 | 0.026 | 0.029 |
| F_TOU | S_DER | F-S | 10544 | 0.027 | 0.026 | 0.029 |
| F_MIL | S_DER | F-S | 10317 | 0.027 | 0.025 | 0.029 |
| F_MER | S_MIL | F-S | 10219 | 0.027 | 0.025 | 0.028 |
| F_STN | S_FRH | F-S | 9382 | 0.027 | 0.025 | 0.028 |
| F_MER | S_DER | F-S | 10620 | 0.025 | 0.024 | 0.027 |
| F_FRH | S_MIL | F-S | 10235 | 0.024 | 0.023 | 0.026 |
| F_COL | S_MIL | F-S | 10297 | 0.023 | 0.021 | 0.024 |
| F_FRH | S_DER | F-S | 10637 | 0.021 | 0.019 | 0.022 |
| F_COL | S_DER | F-S | 10702 | 0.021 | 0.019 | 0.023 |
| F_MKH | S_FRH | F-S | 9627 | 0.018 | 0.017 | 0.020 |
| F_TOU | S_FRH | F-S | 9654 | 0.018 | 0.017 | 0.019 |
| F_NIM | S_FRH | F-S | 9790 | 0.017 | 0.016 | 0.018 |
| F_COL | S_FRH | F-S | 9885 | 0.014 | 0.013 | 0.016 |
| F_MER | S_FRH | F-S | 9762 | 0.014 | 0.013 | 0.015 |
| F_FRH | S_FRH | F-S | 9754 | 0.014 | 0.013 | 0.015 |
| W_USR | L_USR | W-L | 9355 | 0.160 | 0.155 | 0.165 |
| S_BUT | L_USR | S-L | 9685 | 0.051 | 0.049 | 0.054 |
| S_MIL | L_USR | S-L | 9808 | 0.038 | 0.036 | 0.041 |
| S_DER | L_USR | S-L | 10307 | 0.032 | 0.030 | 0.035 |
| S_FRH | L_USR | S-L | 9252 | 0.031 | 0.029 | 0.032 |
| W_USR | S_BUT | W-S | 8812 | 0.159 | 0.154 | 0.164 |
| W_USR | S_MIL | W-S | 8985 | 0.143 | 0.139 | 0.148 |
| W_USR | S_DER | W-S | 9717 | 0.141 | 0.136 | 0.146 |
| W_USR | S_FRH | W-S | 8288 | 0.122 | 0.118 | 0.126 |
Figure S2: Pairwise Fst of individuals grouped by run/tributary. Fst-values denoted with * have lower CI > 0
For all measures of genetic diversity comparisons were made for individuals grouped by run type within each tributary, with wild and hatchery individuals treated as separate groups even if they were sampled in the same tributary. Sample sizes among run/tributary groups are comparable with most in the range of 20 - 30, though only seven Feather River Spring individuals are present.
The following parameters are compared:
For each parameter, significant heterogeneity among groups was determined using a Friedman’s test, which implements a repeated block design (alleles observed per locus present in each sample). Next, significance of pairwise comparisons was tested using a Wilcoxon signed rank test; p-values were adjusted assuming a FDR of 0.05. The use of the signed rank test gives “direction” to the test of significance, due to the block design, i.e. pairwise comparisons show that loci will consistently be lower (higher) across individuals as opposed to a test that does not account for the fact that the same blocks/loci are being observed (e.g. a t-test).
For Tajima’s D, additionally we generated a genome-wide null
distribution of Tajima’s \(D\) for a
set of neutral loci assuming mutation drift-equilibrium reflecting the
composition of the haplotyped empirical data set consisting of the same
number of loci with same distribution of segregating sites by simulating
loci under coalescence using MS (Hudson 2002) to compare to the
empirical data set.
Results of tests of significance for comparisons are presented as a heatmap indicating the test statistic of pairwise comparisons and printed p-values, significantly different comparisons (P < 0.05) is presented. Finally, the distribution of values across loci for individuals grouped by run or run/tributary are presented as boxplots and a table with summary statistics.
To compare whether identified loci are randomly distributed across chromosomes, null distributions were generated by shuffling chromosome designations across loci 1,000 times to determine whether the observed values fall outside the null distribution.
# define groups to analyze ====
setPop(gen) <- ~RUN_LOC
gen_grp <- seppop(gen)
gen_grp[["ALL"]] <- gen
# calculate diversity stats ====
loc_stats <- list()
for (p in names(gen_grp)) {
locA <- locus_table(gen_grp[[p]], index = "shannon") %>%
as.data.frame() %>%
rownames_to_column("LOCUS")
locB <- locus_table(gen_grp[[p]], index = "simpson") %>%
as.data.frame() %>%
rownames_to_column("LOCUS")
temp <- left_join(locA, locB)
locC <- locus_table(gen_grp[[p]], index = "invsimpson") %>%
as.data.frame() %>%
rownames_to_column("LOCUS")
loc_stats[[p]] <- left_join(temp, locC)
}
loc_stats <- ldply(loc_stats, data.frame) %>%
select(-Hexp) %>%
rename(GRP = `.id`,
SIMPSON_IDX = `X1.D`,
N_ALLELES = allele,
SHANNON_IDX = H,
STODD_TAYLOR_IDX = G,
EVENNESS = Evenness)
# calculate genetic diversity stats (heterozygosity-based) ====
loc_stats_2 <- list()
for (p in names(gen_grp)) {
dat <- hierfstat:::.genind2hierfstat(gen_grp[[p]])
stats <- basic.stats(dat)
loc_stats_2[[p]] <- stats$perloc %>%
rownames_to_column("LOCUS")
}
# combine into single data frame ====
loc_stats_2 <- ldply(loc_stats_2, data.frame) %>%
rename(GRP = `.id`)
loc_stats <- left_join(loc_stats, loc_stats_2) %>%
select(GRP, LOCUS, N_ALLELES, EVENNESS, Ho, Hs, Ht, Fis, SHANNON_IDX, SIMPSON_IDX, STODD_TAYLOR_IDX) %>%
filter(LOCUS != "mean")
loc_stats[is.na(loc_stats)] <- NA
write_delim(loc_stats, "results/gendiv.locstats", delim = "\t")
The observed heterozygosity (Ho) is measured as the proportion of heterozygote genotypes per locus (Nei 1987).
Compare distribution of observed heterozygosity within and among runs
Test of significant differences among runs using Friedman’s test.
##
## Friedman rank sum test
##
## data: Ho and GRP and LOCUS
## Friedman chi-squared = 3891.6, df = 16, p-value < 0.00000000000000022
Test for significant pairwise differences between tributaries within runs using Wilcoxon signed rank test.
# remove loci with Na values
rm <- het %>%
filter(is.na(Ho))
het <- het %>%
filter(!LOCUS %in% rm$LOCUS)
# groups to compare
comp <- as.character(unique(het$GRP))
# pairs of comparisons
pairs <- expand.grid(comp, comp) %>%
filter(!Var1 == Var2) %>%
rownames_to_column("PAIR") %>%
split(.$PAIR) %>%
purrr::map(function(x){
x %>%
select(-PAIR) %>%
gather(key = temp, value = GRP, 1:2) %>%
select(-temp)
})
# empty data frame for results
results <- setNames(data.frame(matrix(ncol = 4, nrow = 0)),
c("pop1", "pop2", "stat", "p.value")) %>%
mutate(pop1 = as.character(pop1),
pop2 = as.character(pop2),
stat = as.numeric(stat),
p.value = as.numeric(p.value))
n <- as.numeric(length(pairs))
# loop over pairs
for(i in 1:length(pairs)){
p <- i
pair <- pairs[[p]]$GRP
temp <- het %>%
filter(GRP %in% pair) %>%
mutate(GRP = ordered(GRP, levels = pair),
LOCUS = as.factor(LOCUS)) %>%
droplevels()
wilcox <- wilcoxsign_test(Ho ~ GRP | LOCUS,
data = temp,
zero.method = "Pratt")
df <- data.frame("pop1" = pair[1],
"pop2" = pair[2],
"stat" = as.numeric(wilcox@statistic@teststatistic),
"p-value" = as.numeric(pvalue(wilcox)))
results <- bind_rows(results, df)
}
results <- results %>%
mutate(pop1 = ordered(pop1, levels = trib_runs),
pop2 = ordered(pop2, levels = trib_runs))
write_delim(results, "results/obshet_run_trib.wilcox", delim = "\t")
Identify pairwise significantly different distribution between run/trib groups.
Results pairwise comparisons of observed heterozygosity among tributaries within runs using Wilcoxon signed rank test.
Compare distributions.
Distribution of observed heterozygosity of individuals grouped by tributary within each run.
Significantly different pairwise comparisons of observedheterozygosity across loci for individuals grouped by tributaries withinruns.
| pop1 | pop2 | stat | p.value | p_adj |
|---|---|---|---|---|
| F_FRH | W_USR | 39.1048 | 0.0000 | 0.0000 |
| F_COL | W_USR | 38.6791 | 0.0000 | 0.0000 |
| F_MER | W_USR | 35.3954 | 0.0000 | 0.0000 |
| F_NIM | W_USR | 35.3437 | 0.0000 | 0.0000 |
| S_DER | W_USR | 35.3418 | 0.0000 | 0.0000 |
| F_MKH | W_USR | 35.1831 | 0.0000 | 0.0000 |
| F_MIL | W_USR | 34.3626 | 0.0000 | 0.0000 |
| F_STN | W_USR | 34.3477 | 0.0000 | 0.0000 |
| F_TOU | W_USR | 34.3074 | 0.0000 | 0.0000 |
| L_USR | W_USR | 33.1392 | 0.0000 | 0.0000 |
| F_BUT | W_USR | 32.4029 | 0.0000 | 0.0000 |
| F_DER | W_USR | 31.6226 | 0.0000 | 0.0000 |
| S_MIL | W_USR | 29.6530 | 0.0000 | 0.0000 |
| F_MRH | W_USR | 29.3136 | 0.0000 | 0.0000 |
| S_BUT | W_USR | 26.4736 | 0.0000 | 0.0000 |
| S_FRH | W_USR | 23.0823 | 0.0000 | 0.0000 |
| F_FRH | S_BUT | 16.1773 | 0.0000 | 0.0000 |
| F_COL | S_BUT | 14.3671 | 0.0000 | 0.0000 |
| F_MIL | S_BUT | 11.4354 | 0.0000 | 0.0000 |
| F_MKH | S_BUT | 11.3978 | 0.0000 | 0.0000 |
| F_FRH | F_BUT | 11.3174 | 0.0000 | 0.0000 |
| F_NIM | S_BUT | 11.2935 | 0.0000 | 0.0000 |
| F_FRH | S_MIL | 11.2336 | 0.0000 | 0.0000 |
| F_FRH | S_FRH | 10.8857 | 0.0000 | 0.0000 |
| F_FRH | L_USR | 10.6817 | 0.0000 | 0.0000 |
| F_FRH | F_MRH | 10.6776 | 0.0000 | 0.0000 |
| F_MER | S_BUT | 10.5184 | 0.0000 | 0.0000 |
| F_FRH | F_TOU | 10.3985 | 0.0000 | 0.0000 |
| F_FRH | F_MER | 10.0416 | 0.0000 | 0.0000 |
| F_FRH | F_DER | 9.5463 | 0.0000 | 0.0000 |
| F_TOU | S_BUT | 9.5314 | 0.0000 | 0.0000 |
| F_COL | F_BUT | 9.2579 | 0.0000 | 0.0000 |
| F_FRH | S_DER | 8.8955 | 0.0000 | 0.0000 |
| F_FRH | F_NIM | 8.7359 | 0.0000 | 0.0000 |
| F_COL | L_USR | 8.5759 | 0.0000 | 0.0000 |
| F_FRH | F_MKH | 8.5154 | 0.0000 | 0.0000 |
| F_DER | S_BUT | 8.4337 | 0.0000 | 0.0000 |
| F_COL | S_FRH | 8.3184 | 0.0000 | 0.0000 |
| F_COL | S_MIL | 8.0711 | 0.0000 | 0.0000 |
| S_DER | S_BUT | 7.6913 | 0.0000 | 0.0000 |
| F_STN | S_BUT | 7.6438 | 0.0000 | 0.0000 |
| F_FRH | F_STN | 7.5029 | 0.0000 | 0.0000 |
| F_COL | S_DER | 7.2397 | 0.0000 | 0.0000 |
| F_COL | F_MRH | 7.1334 | 0.0000 | 0.0000 |
| F_MIL | F_BUT | 7.0241 | 0.0000 | 0.0000 |
| F_MKH | S_FRH | 6.8937 | 0.0000 | 0.0000 |
| S_MIL | S_BUT | 6.7613 | 0.0000 | 0.0000 |
| F_MRH | S_BUT | 6.7527 | 0.0000 | 0.0000 |
| F_NIM | F_BUT | 6.4274 | 0.0000 | 0.0000 |
| F_MKH | S_MIL | 6.2274 | 0.0000 | 0.0000 |
| F_BUT | S_BUT | 6.0031 | 0.0000 | 0.0000 |
| F_MKH | F_BUT | 6.0024 | 0.0000 | 0.0000 |
| F_MIL | L_USR | 5.9055 | 0.0000 | 0.0000 |
| F_NIM | S_FRH | 5.8448 | 0.0000 | 0.0000 |
| F_NIM | S_MIL | 5.5633 | 0.0000 | 0.0000 |
| F_COL | F_DER | 5.5048 | 0.0000 | 0.0000 |
| F_COL | F_STN | 5.4600 | 0.0000 | 0.0000 |
| L_USR | S_BUT | 5.2146 | 0.0000 | 0.0000 |
| F_COL | F_TOU | 5.1889 | 0.0000 | 0.0000 |
| F_NIM | L_USR | 5.1206 | 0.0000 | 0.0000 |
| F_FRH | F_MIL | 5.0978 | 0.0000 | 0.0000 |
| F_FRH | F_COL | 5.0077 | 0.0000 | 0.0000 |
| F_MER | S_MIL | 4.9342 | 0.0000 | 0.0000 |
| F_MER | S_FRH | 4.8220 | 0.0000 | 0.0000 |
| F_NIM | F_MRH | 4.8102 | 0.0000 | 0.0000 |
| F_MIL | S_FRH | 4.8086 | 0.0000 | 0.0000 |
| F_MIL | S_DER | 4.7729 | 0.0000 | 0.0000 |
| F_MKH | L_USR | 4.6296 | 0.0000 | 0.0000 |
| F_TOU | S_FRH | 4.5094 | 0.0000 | 0.0000 |
| F_MIL | F_STN | 4.3115 | 0.0000 | 0.0000 |
| F_NIM | S_DER | 4.3009 | 0.0000 | 0.0000 |
| F_COL | F_MER | 4.1906 | 0.0000 | 0.0000 |
| F_MER | F_BUT | 4.1713 | 0.0000 | 0.0000 |
| F_MKH | F_MRH | 4.0767 | 0.0000 | 0.0000 |
| F_MKH | F_DER | 3.9318 | 0.0001 | 0.0002 |
| F_TOU | S_MIL | 3.9263 | 0.0001 | 0.0002 |
| F_STN | S_FRH | 3.9094 | 0.0001 | 0.0002 |
| F_MIL | S_MIL | 3.8983 | 0.0001 | 0.0002 |
| F_MER | L_USR | 3.8340 | 0.0001 | 0.0002 |
| F_DER | F_BUT | 3.8071 | 0.0001 | 0.0002 |
| F_MKH | F_MER | 3.7969 | 0.0001 | 0.0002 |
| F_STN | S_DER | 3.6471 | 0.0003 | 0.0005 |
| F_COL | F_MIL | 3.5339 | 0.0004 | 0.0007 |
| F_MKH | S_DER | 3.4436 | 0.0006 | 0.0010 |
| L_USR | S_FRH | 3.3424 | 0.0008 | 0.0013 |
| F_NIM | F_DER | 3.2529 | 0.0011 | 0.0017 |
| F_MIL | F_MRH | 3.2057 | 0.0013 | 0.0020 |
| S_FRH | S_BUT | 3.2008 | 0.0014 | 0.0022 |
| F_DER | S_MIL | 3.0998 | 0.0019 | 0.0029 |
| F_MER | F_MRH | 2.9651 | 0.0030 | 0.0045 |
| F_MKH | F_TOU | 2.9566 | 0.0031 | 0.0046 |
| F_MER | S_DER | 2.9367 | 0.0033 | 0.0049 |
| F_TOU | F_BUT | 2.9209 | 0.0035 | 0.0051 |
| F_MIL | F_TOU | 2.7212 | 0.0065 | 0.0094 |
| F_DER | L_USR | 2.6895 | 0.0072 | 0.0103 |
| S_DER | S_FRH | 2.6374 | 0.0084 | 0.0119 |
| F_STN | S_MIL | 2.6302 | 0.0085 | 0.0119 |
| F_STN | L_USR | 2.6288 | 0.0086 | 0.0119 |
| F_NIM | F_MER | 2.5304 | 0.0114 | 0.0157 |
| F_MRH | S_MIL | 2.5200 | 0.0117 | 0.0159 |
| F_NIM | F_STN | 2.4878 | 0.0129 | 0.0174 |
| F_MKH | F_NIM | 2.4789 | 0.0132 | 0.0176 |
| F_TOU | F_MRH | 2.3624 | 0.0182 | 0.0240 |
| F_BUT | S_FRH | 2.2552 | 0.0241 | 0.0315 |
| F_BUT | L_USR | 2.2400 | 0.0251 | 0.0325 |
| F_STN | F_MRH | 2.2356 | 0.0254 | 0.0326 |
| F_TOU | L_USR | 2.2095 | 0.0271 | 0.0344 |
Table S3: Mean, standard deviation, median, 25th, and 75thquartile of observed heterozygosity across loci for individuals groupedby run type & tributary.
| GRP | mean | sd | median | Q1 | Q3 |
|---|---|---|---|---|---|
| F_COL | 0.1713 | 0.1886 | 0.1000 | 0 | 0.3000 |
| F_MIL | 0.1699 | 0.1922 | 0.1000 | 0 | 0.2941 |
| F_DER | 0.1674 | 0.1939 | 0.0714 | 0 | 0.2857 |
| F_BUT | 0.1641 | 0.1884 | 0.1000 | 0 | 0.2778 |
| F_FRH | 0.1741 | 0.1908 | 0.1111 | 0 | 0.2963 |
| F_NIM | 0.1688 | 0.1874 | 0.1000 | 0 | 0.3000 |
| F_MKH | 0.1689 | 0.1884 | 0.0800 | 0 | 0.2917 |
| F_STN | 0.1673 | 0.1905 | 0.0909 | 0 | 0.2857 |
| F_TOU | 0.1659 | 0.1849 | 0.0833 | 0 | 0.2857 |
| F_MRH | 0.1658 | 0.1938 | 0.0769 | 0 | 0.2857 |
| F_MER | 0.1670 | 0.1846 | 0.1000 | 0 | 0.2903 |
| L_USR | 0.1654 | 0.1888 | 0.0952 | 0 | 0.2857 |
| W_USR | 0.1272 | 0.1840 | 0.0385 | 0 | 0.2308 |
| S_MIL | 0.1651 | 0.1932 | 0.0714 | 0 | 0.2857 |
| S_DER | 0.1628 | 0.1802 | 0.0870 | 0 | 0.2800 |
| S_BUT | 0.1587 | 0.1888 | 0.0588 | 0 | 0.2778 |
| S_FRH | 0.1695 | 0.2157 | 0.1429 | 0 | 0.2857 |
Expected heterozygosity Hs is calculated as the proportion of heterozygous genotypes expected under Hardy-Weinberg equilibrium (Nei 1987).
Compare distribution of gene diversity within and among runs
Test of significant differences among runs using Friedman’s test.
##
## Friedman rank sum test
##
## data: Hs and GRP and LOCUS
## Friedman chi-squared = 3421.1, df = 16, p-value < 0.00000000000000022
Test for significant pairwise differences between tributaries within runs using Wilcoxon signed rank test.
Results pairwise comparisons of expected heterozygosity among tributaries within runs using Wilcoxon signed rank test.
Compare distributions across groups.
Distribution of expected heterozygosity per locus for individuals grouped by tributary within each run.
Significance of pairwise comparisons of expected heterozygosityfor individuals grouped by tributaries within runs.
| pop1 | pop2 | stat | p.value | p_adj |
|---|---|---|---|---|
| F_FRH | W_USR | 39.2049 | 0.0000 | 0.0000 |
| S_DER | W_USR | 39.1176 | 0.0000 | 0.0000 |
| F_COL | W_USR | 38.1169 | 0.0000 | 0.0000 |
| F_MIL | W_USR | 36.5044 | 0.0000 | 0.0000 |
| F_MER | W_USR | 35.9299 | 0.0000 | 0.0000 |
| F_NIM | W_USR | 35.4636 | 0.0000 | 0.0000 |
| F_TOU | W_USR | 35.2467 | 0.0000 | 0.0000 |
| F_STN | W_USR | 35.0595 | 0.0000 | 0.0000 |
| F_MKH | W_USR | 35.0166 | 0.0000 | 0.0000 |
| L_USR | W_USR | 34.4389 | 0.0000 | 0.0000 |
| F_BUT | W_USR | 33.8632 | 0.0000 | 0.0000 |
| S_MIL | W_USR | 33.7289 | 0.0000 | 0.0000 |
| F_DER | W_USR | 33.4003 | 0.0000 | 0.0000 |
| F_MRH | W_USR | 32.8072 | 0.0000 | 0.0000 |
| S_BUT | W_USR | 29.6358 | 0.0000 | 0.0000 |
| S_FRH | W_USR | 25.8343 | 0.0000 | 0.0000 |
| F_FRH | S_BUT | 10.7653 | 0.0000 | 0.0000 |
| F_MIL | S_BUT | 9.3655 | 0.0000 | 0.0000 |
| S_DER | S_BUT | 9.2576 | 0.0000 | 0.0000 |
| F_COL | S_BUT | 9.1696 | 0.0000 | 0.0000 |
| F_FRH | F_MER | 8.8120 | 0.0000 | 0.0000 |
| F_FRH | F_NIM | 8.5196 | 0.0000 | 0.0000 |
| F_FRH | F_MKH | 8.2886 | 0.0000 | 0.0000 |
| F_MIL | F_BUT | 8.0117 | 0.0000 | 0.0000 |
| F_FRH | F_BUT | 7.8149 | 0.0000 | 0.0000 |
| S_MIL | S_BUT | 7.6811 | 0.0000 | 0.0000 |
| F_FRH | L_USR | 7.6379 | 0.0000 | 0.0000 |
| F_MIL | L_USR | 7.5210 | 0.0000 | 0.0000 |
| F_FRH | F_TOU | 7.4032 | 0.0000 | 0.0000 |
| F_MIL | F_STN | 7.1562 | 0.0000 | 0.0000 |
| F_NIM | S_BUT | 7.0782 | 0.0000 | 0.0000 |
| F_MER | S_BUT | 6.8788 | 0.0000 | 0.0000 |
| F_MRH | S_BUT | 6.6184 | 0.0000 | 0.0000 |
| F_MKH | S_BUT | 6.4468 | 0.0000 | 0.0000 |
| F_DER | S_BUT | 6.3022 | 0.0000 | 0.0000 |
| F_TOU | S_BUT | 6.2453 | 0.0000 | 0.0000 |
| F_FRH | F_DER | 6.2127 | 0.0000 | 0.0000 |
| F_FRH | F_STN | 6.0891 | 0.0000 | 0.0000 |
| F_FRH | S_FRH | 5.9649 | 0.0000 | 0.0000 |
| F_COL | L_USR | 5.6659 | 0.0000 | 0.0000 |
| F_FRH | S_MIL | 5.4279 | 0.0000 | 0.0000 |
| F_FRH | F_COL | 5.3574 | 0.0000 | 0.0000 |
| F_MIL | F_TOU | 5.0770 | 0.0000 | 0.0000 |
| F_COL | F_BUT | 4.7236 | 0.0000 | 0.0000 |
| F_MIL | F_MKH | 4.7030 | 0.0000 | 0.0000 |
| F_MIL | F_MER | 4.5026 | 0.0000 | 0.0000 |
| S_DER | F_TOU | 4.4197 | 0.0000 | 0.0000 |
| F_MIL | F_NIM | 4.3035 | 0.0000 | 0.0000 |
| S_DER | S_FRH | 4.2387 | 0.0000 | 0.0000 |
| F_BUT | S_BUT | 4.2346 | 0.0000 | 0.0000 |
| F_FRH | F_MRH | 4.1109 | 0.0000 | 0.0000 |
| S_DER | F_MKH | 4.0649 | 0.0000 | 0.0000 |
| F_MIL | S_FRH | 3.8499 | 0.0001 | 0.0003 |
| F_STN | S_BUT | 3.7820 | 0.0002 | 0.0005 |
| F_MRH | F_BUT | 3.7347 | 0.0002 | 0.0005 |
| S_DER | F_MER | 3.7025 | 0.0002 | 0.0005 |
| F_COL | S_FRH | 3.6908 | 0.0002 | 0.0005 |
| F_DER | L_USR | 3.5925 | 0.0003 | 0.0007 |
| S_DER | F_BUT | 3.5784 | 0.0003 | 0.0007 |
| S_DER | F_NIM | 3.4957 | 0.0005 | 0.0011 |
| F_COL | F_STN | 3.4423 | 0.0006 | 0.0013 |
| F_DER | F_BUT | 3.3728 | 0.0007 | 0.0015 |
| F_MRH | F_DER | 3.2570 | 0.0011 | 0.0024 |
| F_MRH | L_USR | 3.1089 | 0.0019 | 0.0040 |
| S_DER | L_USR | 2.9818 | 0.0029 | 0.0061 |
| L_USR | S_BUT | 2.8755 | 0.0040 | 0.0082 |
| F_MKH | F_MER | 2.7790 | 0.0055 | 0.0112 |
| F_MKH | F_NIM | 2.7182 | 0.0066 | 0.0132 |
| S_MIL | F_BUT | 2.7065 | 0.0068 | 0.0134 |
| F_BUT | L_USR | 2.6942 | 0.0071 | 0.0136 |
| F_MIL | F_DER | 2.6926 | 0.0071 | 0.0136 |
| F_NIM | L_USR | 2.6676 | 0.0076 | 0.0144 |
| S_DER | F_DER | 2.6247 | 0.0087 | 0.0162 |
| S_MIL | L_USR | 2.5775 | 0.0100 | 0.0184 |
| F_MKH | L_USR | 2.5647 | 0.0103 | 0.0187 |
| F_MER | L_USR | 2.5351 | 0.0112 | 0.0200 |
| F_COL | F_MER | 2.4938 | 0.0126 | 0.0221 |
| S_DER | F_STN | 2.4919 | 0.0127 | 0.0221 |
| F_NIM | F_BUT | 2.4844 | 0.0130 | 0.0224 |
| F_MER | F_BUT | 2.4674 | 0.0136 | 0.0231 |
| S_FRH | S_BUT | 2.3866 | 0.0170 | 0.0285 |
| F_NIM | S_FRH | 2.2779 | 0.0227 | 0.0376 |
| F_MKH | F_BUT | 2.2373 | 0.0253 | 0.0415 |
| S_DER | S_MIL | 2.1591 | 0.0308 | 0.0499 |
Table S4: Mean, standard deviation, median, 25th, and 75thquartile of observed heterozygosity per locus for individuals grouped byrun type & tributary.
| GRP | mean | sd | median | Q1 | Q3 |
|---|---|---|---|---|---|
| F_COL | 0.1693 | 0.1798 | 0.0966 | 0 | 0.3034 |
| F_MIL | 0.1714 | 0.1845 | 0.1078 | 0 | 0.3125 |
| F_DER | 0.1685 | 0.1850 | 0.0769 | 0 | 0.3022 |
| F_BUT | 0.1662 | 0.1827 | 0.0974 | 0 | 0.2974 |
| F_FRH | 0.1718 | 0.1810 | 0.1068 | 0 | 0.3063 |
| F_NIM | 0.1678 | 0.1804 | 0.0966 | 0 | 0.3034 |
| F_MKH | 0.1680 | 0.1810 | 0.1032 | 0 | 0.3003 |
| F_STN | 0.1670 | 0.1827 | 0.0929 | 0 | 0.3030 |
| F_TOU | 0.1676 | 0.1807 | 0.0998 | 0 | 0.3014 |
| F_MRH | 0.1695 | 0.1870 | 0.0833 | 0 | 0.3077 |
| F_MER | 0.1676 | 0.1799 | 0.0966 | 0 | 0.2960 |
| L_USR | 0.1668 | 0.1831 | 0.0929 | 0 | 0.3000 |
| W_USR | 0.1285 | 0.1789 | 0.0385 | 0 | 0.2369 |
| S_MIL | 0.1695 | 0.1877 | 0.0769 | 0 | 0.3125 |
| S_DER | 0.1699 | 0.1808 | 0.1068 | 0 | 0.3030 |
| S_BUT | 0.1642 | 0.1874 | 0.1023 | 0 | 0.3070 |
| S_FRH | 0.1695 | 0.1983 | 0.1429 | 0 | 0.3571 |
The inbreeding coefficient Fis is calculated as 1-Ho/He (Weir and Cockerham 1984) and indicates heterozygote deficiencies compared to expectations under Hardy-Weinberg Equilibrium, i.e. it is a measure of deviation from panmixia at local scales.
Compare distribution of heterozygosity deficiency (Fis) within and among runs
Test of significant differences among runs using Friedman’s test.
##
## Friedman rank sum test
##
## data: Fis and GRP and LOCUS
## Friedman chi-squared = 382.82, df = 16, p-value < 0.00000000000000022
Test for significant pairwise differences between tributaries within runs using Wilcoxon signed rank test.
Results pairwise comparisons of Fis among tributaries within runs using Wilcoxon signed rank test.
Compare distributions across groups.
Distribution of Fis per locus for individuals grouped by tributary within each run.
Significance of pairwise comparisons of Fis for individualsgrouped by tributaries within runs.
| pop1 | pop2 | stat | p.value | p_adj |
|---|---|---|---|---|
| S_DER | F_FRH | 12.9986 | 0.0000 | 0.0000 |
| S_DER | F_COL | 12.3879 | 0.0000 | 0.0000 |
| S_DER | F_MKH | 11.6782 | 0.0000 | 0.0000 |
| S_DER | F_NIM | 11.2995 | 0.0000 | 0.0000 |
| S_DER | S_FRH | 10.4531 | 0.0000 | 0.0000 |
| S_DER | F_STN | 9.7002 | 0.0000 | 0.0000 |
| S_BUT | F_FRH | 9.3555 | 0.0000 | 0.0000 |
| S_DER | F_DER | 8.9582 | 0.0000 | 0.0000 |
| S_DER | F_MER | 8.9367 | 0.0000 | 0.0000 |
| S_BUT | F_COL | 8.8651 | 0.0000 | 0.0000 |
| S_DER | F_MIL | 8.5151 | 0.0000 | 0.0000 |
| S_BUT | S_FRH | 8.5124 | 0.0000 | 0.0000 |
| S_DER | L_USR | 8.0760 | 0.0000 | 0.0000 |
| S_BUT | F_NIM | 7.9226 | 0.0000 | 0.0000 |
| S_BUT | F_MKH | 7.9064 | 0.0000 | 0.0000 |
| S_DER | F_TOU | 7.1567 | 0.0000 | 0.0000 |
| W_USR | S_FRH | 7.0839 | 0.0000 | 0.0000 |
| W_USR | F_FRH | 6.8825 | 0.0000 | 0.0000 |
| S_BUT | F_STN | 6.8440 | 0.0000 | 0.0000 |
| S_MIL | F_FRH | 6.7485 | 0.0000 | 0.0000 |
| S_DER | F_BUT | 6.5330 | 0.0000 | 0.0000 |
| W_USR | F_COL | 6.5018 | 0.0000 | 0.0000 |
| S_BUT | F_DER | 6.2774 | 0.0000 | 0.0000 |
| F_TOU | F_FRH | 6.2393 | 0.0000 | 0.0000 |
| S_MIL | S_FRH | 6.1885 | 0.0000 | 0.0000 |
| S_MIL | F_COL | 6.1795 | 0.0000 | 0.0000 |
| F_TOU | S_FRH | 5.9771 | 0.0000 | 0.0000 |
| S_DER | W_USR | 5.9011 | 0.0000 | 0.0000 |
| S_BUT | F_MIL | 5.8820 | 0.0000 | 0.0000 |
| F_TOU | F_COL | 5.8464 | 0.0000 | 0.0000 |
| S_BUT | F_MER | 5.8370 | 0.0000 | 0.0000 |
| S_DER | F_MRH | 5.7265 | 0.0000 | 0.0000 |
| F_MRH | S_FRH | 5.5541 | 0.0000 | 0.0000 |
| F_BUT | S_FRH | 5.4888 | 0.0000 | 0.0000 |
| S_MIL | F_MKH | 5.3953 | 0.0000 | 0.0000 |
| F_MRH | F_FRH | 5.3218 | 0.0000 | 0.0000 |
| F_BUT | F_FRH | 5.2537 | 0.0000 | 0.0000 |
| W_USR | F_NIM | 5.0250 | 0.0000 | 0.0000 |
| F_BUT | F_COL | 5.0042 | 0.0000 | 0.0000 |
| L_USR | F_FRH | 4.9626 | 0.0000 | 0.0000 |
| S_MIL | F_NIM | 4.8518 | 0.0000 | 0.0000 |
| W_USR | F_MKH | 4.7828 | 0.0000 | 0.0000 |
| S_BUT | L_USR | 4.7529 | 0.0000 | 0.0000 |
| F_MRH | F_COL | 4.7432 | 0.0000 | 0.0000 |
| L_USR | S_FRH | 4.6219 | 0.0000 | 0.0000 |
| S_MIL | F_DER | 4.4091 | 0.0000 | 0.0000 |
| F_MER | S_FRH | 4.3662 | 0.0000 | 0.0000 |
| F_BUT | F_MKH | 4.3205 | 0.0000 | 0.0000 |
| S_MIL | F_STN | 4.2781 | 0.0000 | 0.0000 |
| F_TOU | F_MKH | 4.2429 | 0.0000 | 0.0000 |
| F_MRH | F_MKH | 4.2335 | 0.0000 | 0.0000 |
| F_MIL | S_FRH | 4.2321 | 0.0000 | 0.0000 |
| W_USR | F_STN | 4.1502 | 0.0000 | 0.0000 |
| L_USR | F_COL | 4.1256 | 0.0000 | 0.0000 |
| F_TOU | F_NIM | 4.0931 | 0.0000 | 0.0000 |
| S_BUT | F_TOU | 3.9108 | 0.0001 | 0.0002 |
| S_DER | S_MIL | 3.8990 | 0.0001 | 0.0002 |
| S_BUT | F_BUT | 3.8714 | 0.0001 | 0.0002 |
| F_MER | F_FRH | 3.8639 | 0.0001 | 0.0002 |
| W_USR | F_DER | 3.6270 | 0.0003 | 0.0006 |
| S_MIL | F_MIL | 3.6189 | 0.0003 | 0.0006 |
| S_MIL | F_MER | 3.6182 | 0.0003 | 0.0006 |
| F_NIM | S_FRH | 3.5938 | 0.0003 | 0.0006 |
| F_BUT | F_NIM | 3.5459 | 0.0004 | 0.0008 |
| S_BUT | F_MRH | 3.5442 | 0.0004 | 0.0008 |
| F_MER | F_COL | 3.5146 | 0.0004 | 0.0008 |
| F_MRH | F_NIM | 3.5124 | 0.0004 | 0.0008 |
| F_STN | S_FRH | 3.4311 | 0.0006 | 0.0012 |
| F_DER | S_FRH | 3.3807 | 0.0007 | 0.0014 |
| F_TOU | F_STN | 3.3040 | 0.0010 | 0.0019 |
| F_TOU | F_DER | 3.1047 | 0.0019 | 0.0036 |
| F_MRH | F_DER | 3.0935 | 0.0020 | 0.0038 |
| W_USR | F_MIL | 3.0501 | 0.0023 | 0.0043 |
| F_MRH | F_STN | 2.9900 | 0.0028 | 0.0051 |
| W_USR | F_MER | 2.9355 | 0.0033 | 0.0060 |
| S_DER | S_BUT | 2.9150 | 0.0036 | 0.0064 |
| F_MIL | F_FRH | 2.8198 | 0.0048 | 0.0085 |
| L_USR | F_NIM | 2.7748 | 0.0055 | 0.0096 |
| L_USR | F_MKH | 2.7127 | 0.0067 | 0.0115 |
| S_MIL | L_USR | 2.6934 | 0.0071 | 0.0121 |
| F_BUT | F_STN | 2.6779 | 0.0074 | 0.0124 |
| F_MIL | F_COL | 2.6583 | 0.0079 | 0.0131 |
| F_MKH | S_FRH | 2.6466 | 0.0081 | 0.0133 |
| S_BUT | W_USR | 2.5876 | 0.0097 | 0.0157 |
| F_NIM | F_FRH | 2.5694 | 0.0102 | 0.0163 |
| F_STN | F_FRH | 2.5558 | 0.0106 | 0.0168 |
| F_MER | F_MKH | 2.5383 | 0.0111 | 0.0173 |
| F_BUT | F_DER | 2.5360 | 0.0112 | 0.0173 |
| F_DER | F_FRH | 2.4110 | 0.0159 | 0.0243 |
| F_TOU | F_MER | 2.2663 | 0.0234 | 0.0354 |
| F_MRH | F_MIL | 2.2467 | 0.0247 | 0.0369 |
| L_USR | F_DER | 2.1934 | 0.0283 | 0.0414 |
| F_BUT | F_MIL | 2.1930 | 0.0283 | 0.0414 |
| F_COL | S_FRH | 2.1666 | 0.0303 | 0.0438 |
| L_USR | F_STN | 2.1160 | 0.0343 | 0.0491 |
Pairwise comparisons of Fis that are not significantlydifferent for individuals grouped by tributaries within runs.
| pop1 | pop2 | stat | p.value | p_adj |
|---|---|---|---|---|
| F_BUT | F_MER | 2.0850 | 0.0371 | 0.0526 |
| F_MRH | F_MER | 2.0742 | 0.0381 | 0.0534 |
| F_TOU | F_MIL | 2.0110 | 0.0443 | 0.0615 |
| S_MIL | F_TOU | 1.9924 | 0.0463 | 0.0636 |
| W_USR | L_USR | 1.9798 | 0.0477 | 0.0649 |
| F_DER | F_COL | 1.9702 | 0.0488 | 0.0657 |
| F_MKH | F_FRH | 1.9402 | 0.0524 | 0.0699 |
| F_FRH | S_FRH | 1.7935 | 0.0729 | 0.0963 |
| F_STN | F_COL | 1.7595 | 0.0785 | 0.1027 |
| F_NIM | F_COL | 1.7229 | 0.0849 | 0.1100 |
| F_MIL | F_MKH | 1.7087 | 0.0875 | 0.1123 |
| S_MIL | F_BUT | 1.6997 | 0.0892 | 0.1134 |
| S_BUT | S_MIL | 1.6109 | 0.1072 | 0.1350 |
| F_MER | F_NIM | 1.6052 | 0.1084 | 0.1353 |
| W_USR | F_BUT | 1.4768 | 0.1397 | 0.1727 |
| F_MRH | L_USR | 1.3439 | 0.1790 | 0.2193 |
| S_MIL | F_MRH | 1.2944 | 0.1955 | 0.2374 |
| F_MKH | F_COL | 1.2105 | 0.2261 | 0.2721 |
| F_MIL | F_NIM | 1.1771 | 0.2392 | 0.2854 |
| F_MER | F_DER | 1.0953 | 0.2734 | 0.3233 |
| F_TOU | L_USR | 1.0781 | 0.2810 | 0.3294 |
| F_MRH | F_BUT | 0.9302 | 0.3523 | 0.4095 |
| L_USR | F_MIL | 0.9056 | 0.3652 | 0.4209 |
| F_MER | F_STN | 0.8594 | 0.3901 | 0.4458 |
| F_DER | F_MKH | 0.8017 | 0.4227 | 0.4791 |
| W_USR | F_TOU | 0.7884 | 0.4305 | 0.4835 |
| F_NIM | F_MKH | 0.7824 | 0.4340 | 0.4835 |
| S_MIL | W_USR | 0.7767 | 0.4373 | 0.4835 |
| F_STN | F_MKH | 0.7604 | 0.4470 | 0.4888 |
| L_USR | F_MER | 0.7566 | 0.4493 | 0.4888 |
| F_STN | F_NIM | 0.7182 | 0.4726 | 0.5101 |
| W_USR | F_MRH | 0.7049 | 0.4809 | 0.5150 |
| F_MIL | F_STN | 0.6580 | 0.5106 | 0.5425 |
| F_BUT | L_USR | 0.5643 | 0.5725 | 0.6036 |
| F_MER | F_MIL | 0.5042 | 0.6141 | 0.6424 |
| F_DER | F_NIM | 0.4850 | 0.6277 | 0.6517 |
| F_MIL | F_DER | 0.3097 | 0.7568 | 0.7797 |
| F_COL | F_FRH | 0.2366 | 0.8130 | 0.8313 |
| F_MRH | F_TOU | 0.1503 | 0.8805 | 0.8936 |
| F_TOU | F_BUT | 0.1394 | 0.8891 | 0.8957 |
| F_DER | F_STN | 0.0867 | 0.9309 | 0.9309 |
Table S5: Mean, standard deviation, median, 25th, and 75thquartile of Fis per locus for individuals grouped by run type &tributary.
| GRP | mean | sd | median | Q1 | Q3 |
|---|---|---|---|---|---|
| F_COL | -0.0171 | 0.1823 | -0.0404 | -0.1373 | 0.1002 |
| F_MIL | -0.0010 | 0.2328 | -0.0370 | -0.1515 | 0.1387 |
| F_DER | -0.0020 | 0.2470 | -0.0400 | -0.1667 | 0.1340 |
| F_BUT | 0.0077 | 0.2217 | -0.0286 | -0.1333 | 0.1461 |
| F_FRH | -0.0194 | 0.1889 | -0.0417 | -0.1429 | 0.0979 |
| F_NIM | -0.0100 | 0.1847 | -0.0370 | -0.1277 | 0.1018 |
| F_MKH | -0.0131 | 0.1835 | -0.0400 | -0.1304 | 0.0983 |
| F_STN | -0.0060 | 0.2100 | -0.0471 | -0.1429 | 0.1220 |
| F_TOU | 0.0064 | 0.1941 | -0.0216 | -0.1212 | 0.1253 |
| F_MRH | 0.0149 | 0.2630 | -0.0400 | -0.1507 | 0.1724 |
| F_MER | -0.0038 | 0.1820 | -0.0300 | -0.1200 | 0.1064 |
| L_USR | 0.0029 | 0.2144 | -0.0270 | -0.1429 | 0.1411 |
| W_USR | 0.0104 | 0.1989 | -0.0204 | -0.0870 | 0.0964 |
| S_MIL | 0.0205 | 0.2587 | -0.0345 | -0.1526 | 0.1875 |
| S_DER | 0.0370 | 0.2085 | 0.0000 | -0.1064 | 0.1732 |
| S_BUT | 0.0273 | 0.2273 | -0.0149 | -0.1250 | 0.1628 |
| S_FRH | -0.0141 | 0.3251 | -0.0909 | -0.2000 | 0.1429 |
The number of alleles observed per locus is not independent from sample size, especially for small samples sizes; with an increasing number of individuals sampled an increasing number of alleles is expected to be identified as rare alleles are more likely to be observed only for larger sample sizes. Differences in sample size can be adjusted for using rarefaction, i.e. by determining the lowest sample size and then subsampling x individuals from each group and counting alleles and using the mean number as the allele count.
Compare differences in rarefied allele counts among tributaries within runs
Calculate allele count per locus corrected for differences in sample size using rarefaction.
# by tributary within run
setPop(gen) <- ~RUN_LOC
# calculate allele counts using rarefaction
dat <- genind2hierfstat(gen)
df <- allelic.richness(dat,
diploid = TRUE)
df <- as.data.frame(df$Ar) %>%
rownames_to_column("LOCUS")
colnames(df) <- c("LOCUS", trib_runs)
write_delim(df, "results/trib_run_rarefied.allelecount", delim = "\t")
Test of significant differences among runs using Friedman’s test.
##
## Friedman rank sum test
##
## data: AR and pop and LOCUS
## Friedman chi-squared = 4236.7, df = 16, p-value < 0.00000000000000022
Test for significant pairwise differences between tributaries within runs using Wilcoxon signed rank test.
Significance of pairwise comparison of rarefied allele counts among runs using Wilcoxon signed rank test. Tile color indicates p-value, label is stat (indicates direction of difference); p-values were adjusted for multiple comparisons using Benjamini-Hochberg.
Compare distributions of values.
Distribution of rarefied allele counts per locus for individuals grouped by run type within tributaries.
Significance of pairwise comparisons of allele counts per locusfor individuals grouped by run type within tributaries.
| pop1 | pop2 | stat | p.value | p_adj |
|---|---|---|---|---|
| F_MIL | L_USR | 47.3156 | 0.0000 | 0.0000 |
| W_USR | L_USR | 46.7328 | 0.0000 | 0.0000 |
| F_MRH | L_USR | 46.4299 | 0.0000 | 0.0000 |
| F_FRH | L_USR | 44.2771 | 0.0000 | 0.0000 |
| F_STN | L_USR | 43.5133 | 0.0000 | 0.0000 |
| F_MKH | L_USR | 43.0950 | 0.0000 | 0.0000 |
| F_TOU | L_USR | 42.7212 | 0.0000 | 0.0000 |
| F_COL | L_USR | 42.6343 | 0.0000 | 0.0000 |
| F_MER | L_USR | 42.3246 | 0.0000 | 0.0000 |
| F_NIM | L_USR | 41.3909 | 0.0000 | 0.0000 |
| S_FRH | L_USR | 41.3626 | 0.0000 | 0.0000 |
| F_BUT | L_USR | 41.2526 | 0.0000 | 0.0000 |
| S_MIL | L_USR | 40.7391 | 0.0000 | 0.0000 |
| F_DER | L_USR | 40.5246 | 0.0000 | 0.0000 |
| S_BUT | L_USR | 35.6730 | 0.0000 | 0.0000 |
| S_DER | L_USR | 34.6154 | 0.0000 | 0.0000 |
| F_MIL | S_DER | 14.2654 | 0.0000 | 0.0000 |
| F_FRH | S_DER | 13.1600 | 0.0000 | 0.0000 |
| F_MRH | S_DER | 12.9707 | 0.0000 | 0.0000 |
| S_BUT | S_DER | 11.9349 | 0.0000 | 0.0000 |
| W_USR | S_DER | 11.6031 | 0.0000 | 0.0000 |
| F_TOU | S_DER | 10.5659 | 0.0000 | 0.0000 |
| S_MIL | S_DER | 10.5214 | 0.0000 | 0.0000 |
| F_BUT | S_DER | 10.3866 | 0.0000 | 0.0000 |
| F_DER | S_DER | 10.3305 | 0.0000 | 0.0000 |
| F_MIL | F_STN | 10.2502 | 0.0000 | 0.0000 |
| F_STN | S_DER | 10.2064 | 0.0000 | 0.0000 |
| F_FRH | S_FRH | 9.6038 | 0.0000 | 0.0000 |
| F_MIL | F_TOU | 9.4725 | 0.0000 | 0.0000 |
| F_COL | S_DER | 9.3000 | 0.0000 | 0.0000 |
| F_MIL | F_COL | 9.2255 | 0.0000 | 0.0000 |
| F_MKH | S_DER | 9.1481 | 0.0000 | 0.0000 |
| F_FRH | F_MER | 8.6395 | 0.0000 | 0.0000 |
| F_MIL | S_FRH | 8.4771 | 0.0000 | 0.0000 |
| F_FRH | F_NIM | 8.4586 | 0.0000 | 0.0000 |
| F_MIL | F_MKH | 8.1921 | 0.0000 | 0.0000 |
| S_BUT | S_FRH | 8.0404 | 0.0000 | 0.0000 |
| F_NIM | S_DER | 7.3800 | 0.0000 | 0.0000 |
| F_MER | S_DER | 7.1594 | 0.0000 | 0.0000 |
| F_FRH | F_MKH | 7.1041 | 0.0000 | 0.0000 |
| F_MIL | F_NIM | 6.9014 | 0.0000 | 0.0000 |
| S_BUT | F_NIM | 6.8211 | 0.0000 | 0.0000 |
| F_FRH | F_STN | 6.8165 | 0.0000 | 0.0000 |
| F_MRH | S_FRH | 6.7512 | 0.0000 | 0.0000 |
| S_BUT | F_MER | 6.5489 | 0.0000 | 0.0000 |
| S_BUT | S_MIL | 6.5485 | 0.0000 | 0.0000 |
| F_FRH | F_COL | 6.5399 | 0.0000 | 0.0000 |
| S_BUT | F_DER | 6.4543 | 0.0000 | 0.0000 |
| F_FRH | F_TOU | 6.4394 | 0.0000 | 0.0000 |
| F_BUT | S_FRH | 6.3824 | 0.0000 | 0.0000 |
| S_BUT | F_BUT | 6.0932 | 0.0000 | 0.0000 |
| F_MIL | F_MER | 6.0609 | 0.0000 | 0.0000 |
| F_MIL | F_MRH | 6.0357 | 0.0000 | 0.0000 |
| F_MIL | S_MIL | 5.6104 | 0.0000 | 0.0000 |
| F_DER | S_FRH | 5.3952 | 0.0000 | 0.0000 |
| F_DER | F_NIM | 5.3912 | 0.0000 | 0.0000 |
| F_BUT | F_NIM | 5.2145 | 0.0000 | 0.0000 |
| S_FRH | S_DER | 5.0933 | 0.0000 | 0.0000 |
| F_DER | F_COL | 4.7337 | 0.0000 | 0.0000 |
| S_BUT | F_COL | 4.6233 | 0.0000 | 0.0000 |
| F_DER | F_STN | 4.5255 | 0.0000 | 0.0000 |
| S_BUT | F_MKH | 4.4261 | 0.0000 | 0.0000 |
| W_USR | F_MKH | 4.4182 | 0.0000 | 0.0000 |
| W_USR | F_COL | 4.3170 | 0.0000 | 0.0000 |
| F_DER | F_MKH | 4.1916 | 0.0000 | 0.0000 |
| F_MRH | F_NIM | 4.1878 | 0.0000 | 0.0000 |
| F_DER | F_MER | 4.1412 | 0.0000 | 0.0000 |
| S_BUT | F_STN | 4.0940 | 0.0000 | 0.0000 |
| F_NIM | S_FRH | 4.0116 | 0.0001 | 0.0002 |
| W_USR | F_TOU | 3.8267 | 0.0001 | 0.0002 |
| W_USR | F_STN | 3.8196 | 0.0001 | 0.0002 |
| F_MIL | F_BUT | 3.7908 | 0.0002 | 0.0004 |
| F_BUT | F_MER | 3.7600 | 0.0002 | 0.0004 |
| F_DER | F_TOU | 3.6623 | 0.0002 | 0.0004 |
| W_USR | S_FRH | 3.6614 | 0.0003 | 0.0005 |
| F_MRH | F_MER | 3.6446 | 0.0003 | 0.0005 |
| F_MRH | F_STN | 3.5797 | 0.0003 | 0.0005 |
| F_BUT | F_STN | 3.5178 | 0.0004 | 0.0007 |
| S_BUT | F_TOU | 3.4181 | 0.0006 | 0.0010 |
| F_FRH | F_MRH | 3.3930 | 0.0007 | 0.0012 |
| F_BUT | F_MKH | 3.3042 | 0.0010 | 0.0017 |
| F_BUT | F_TOU | 3.1003 | 0.0019 | 0.0032 |
| S_MIL | S_FRH | 3.0946 | 0.0020 | 0.0033 |
| F_DER | S_MIL | 3.0470 | 0.0023 | 0.0037 |
| F_BUT | F_COL | 2.9286 | 0.0034 | 0.0054 |
| F_TOU | S_FRH | 2.9152 | 0.0036 | 0.0057 |
| F_COL | F_STN | 2.8340 | 0.0046 | 0.0072 |
| F_COL | F_TOU | 2.8103 | 0.0049 | 0.0076 |
| S_BUT | F_FRH | 2.7886 | 0.0053 | 0.0081 |
| F_STN | S_FRH | 2.7431 | 0.0061 | 0.0092 |
| F_DER | F_BUT | 2.7099 | 0.0067 | 0.0100 |
| W_USR | F_NIM | 2.7077 | 0.0068 | 0.0101 |
| F_FRH | W_USR | 2.6747 | 0.0075 | 0.0110 |
| F_COL | S_FRH | 2.4104 | 0.0159 | 0.0230 |
| F_NIM | F_MER | 2.3342 | 0.0196 | 0.0281 |
| S_MIL | F_NIM | 2.2807 | 0.0226 | 0.0320 |
| F_FRH | S_MIL | 2.2315 | 0.0256 | 0.0359 |
| F_MKH | S_FRH | 2.1175 | 0.0342 | 0.0475 |
| S_MIL | F_MER | 2.0975 | 0.0360 | 0.0495 |
Table S6: Mean, standard deviation, median, 25th, and 75thquartile of allelic richness across loci for individuals grouped by runtype & tributary.
| pop | mean | sd | median | Q1 | Q3 |
|---|---|---|---|---|---|
| F_COL | 1.51 | 0.48 | 1.45 | 1 | 1.91 |
| F_MIL | 1.52 | 0.48 | 1.47 | 1 | 1.92 |
| F_DER | 1.51 | 0.50 | 1.42 | 1 | 1.93 |
| F_BUT | 1.51 | 0.49 | 1.38 | 1 | 1.93 |
| F_FRH | 1.52 | 0.49 | 1.48 | 1 | 1.93 |
| F_NIM | 1.51 | 0.49 | 1.44 | 1 | 1.91 |
| F_MKH | 1.51 | 0.48 | 1.44 | 1 | 1.91 |
| F_STN | 1.51 | 0.48 | 1.43 | 1 | 1.90 |
| F_TOU | 1.51 | 0.48 | 1.43 | 1 | 1.91 |
| F_MRH | 1.52 | 0.48 | 1.43 | 1 | 1.91 |
| F_MER | 1.51 | 0.48 | 1.42 | 1 | 1.91 |
| L_USR | 1.36 | 0.45 | 1.19 | 1 | 1.80 |
| W_USR | 1.52 | 0.48 | 1.47 | 1 | 1.91 |
| S_MIL | 1.51 | 0.50 | 1.38 | 1 | 1.93 |
| S_DER | 1.48 | 0.50 | 1.46 | 1 | 1.92 |
| S_BUT | 1.51 | 0.55 | 1.71 | 1 | 1.99 |
| S_FRH | 1.50 | 0.49 | 1.42 | 1 | 1.91 |
The Evenness of allelic diversity at a given locus
is calculated as the ratio of the number of abundant genotypes to the
number of rarer genotypes calculated using the ratio of
Stoddart & Tayolor index (diversity index weighted for
more abundant alleles) & Shannon-Wiener index
(diversity index weighted for more rare alleles).
Compare differences in evenness of allelic diversity among tributaries within runs
Test of significant differences among runs using Friedman’s test.
##
## Friedman rank sum test
##
## data: EVENNESS and GRP and LOCUS
## Friedman chi-squared = 791.32, df = 16, p-value < 0.00000000000000022
Test for significant pairwise differences between runs using Wilcoxon signed rank test.
Significance of pairwise comparisons of evenness of allelic diversity per locus for individuals grouped by run type within tributaries using using Wilcoxon signed rank test. Tile color indicates stat (indicates direction of difference); p-values were adjusted for multiple comparisons using Benjamini-Hochberg.
Significance of pairwise comparisons of evenness of allelicdiversity across loci for individuals grouped by run type withintributaries.
| pop1 | pop2 | stat | p.value | p_adj |
|---|---|---|---|---|
| F_FRH | W_USR | 21.10 | 0.0000 | 0.0000 |
| S_FRH | W_USR | 20.80 | 0.0000 | 0.0000 |
| F_MIL | W_USR | 20.36 | 0.0000 | 0.0000 |
| S_MIL | W_USR | 20.31 | 0.0000 | 0.0000 |
| F_COL | W_USR | 20.18 | 0.0000 | 0.0000 |
| S_DER | W_USR | 19.95 | 0.0000 | 0.0000 |
| F_DER | W_USR | 19.24 | 0.0000 | 0.0000 |
| F_MER | W_USR | 19.24 | 0.0000 | 0.0000 |
| F_MRH | W_USR | 19.22 | 0.0000 | 0.0000 |
| F_NIM | W_USR | 19.12 | 0.0000 | 0.0000 |
| F_STN | W_USR | 19.02 | 0.0000 | 0.0000 |
| F_MKH | W_USR | 18.97 | 0.0000 | 0.0000 |
| F_BUT | W_USR | 18.78 | 0.0000 | 0.0000 |
| F_TOU | W_USR | 18.78 | 0.0000 | 0.0000 |
| L_USR | W_USR | 18.59 | 0.0000 | 0.0000 |
| S_BUT | W_USR | 17.83 | 0.0000 | 0.0000 |
| F_FRH | F_TOU | 4.65 | 0.0000 | 0.0000 |
| S_FRH | S_BUT | 4.61 | 0.0000 | 0.0000 |
| F_FRH | S_BUT | 4.55 | 0.0000 | 0.0000 |
| F_FRH | L_USR | 4.34 | 0.0000 | 0.0000 |
| S_FRH | F_TOU | 4.24 | 0.0000 | 0.0000 |
| F_FRH | F_BUT | 4.20 | 0.0000 | 0.0000 |
| S_FRH | L_USR | 4.19 | 0.0000 | 0.0000 |
| F_FRH | F_MKH | 4.04 | 0.0001 | 0.0005 |
| F_FRH | F_NIM | 3.99 | 0.0001 | 0.0005 |
| F_FRH | F_MER | 3.91 | 0.0001 | 0.0005 |
| S_FRH | F_NIM | 3.89 | 0.0001 | 0.0005 |
| S_FRH | F_BUT | 3.89 | 0.0001 | 0.0005 |
| S_FRH | F_MKH | 3.82 | 0.0001 | 0.0005 |
| S_FRH | F_MER | 3.75 | 0.0002 | 0.0009 |
| S_MIL | S_BUT | 3.63 | 0.0003 | 0.0013 |
| F_MIL | S_BUT | 3.55 | 0.0004 | 0.0016 |
| F_MIL | F_BUT | 3.53 | 0.0004 | 0.0016 |
| F_FRH | F_STN | 3.50 | 0.0005 | 0.0019 |
| F_MIL | F_TOU | 3.47 | 0.0005 | 0.0019 |
| S_FRH | F_STN | 3.41 | 0.0007 | 0.0026 |
| S_FRH | F_DER | 3.40 | 0.0007 | 0.0026 |
| S_FRH | S_DER | 3.34 | 0.0008 | 0.0029 |
| F_FRH | F_DER | 3.34 | 0.0009 | 0.0031 |
| S_FRH | F_COL | 3.31 | 0.0009 | 0.0031 |
| F_FRH | F_COL | 3.30 | 0.0010 | 0.0033 |
| F_MIL | L_USR | 3.12 | 0.0018 | 0.0058 |
| F_MIL | F_NIM | 3.10 | 0.0020 | 0.0063 |
| F_FRH | S_DER | 2.96 | 0.0031 | 0.0096 |
| F_MIL | F_MKH | 2.76 | 0.0058 | 0.0175 |
| F_COL | S_BUT | 2.72 | 0.0065 | 0.0192 |
| F_MIL | F_DER | 2.70 | 0.0068 | 0.0197 |
| F_MRH | S_BUT | 2.62 | 0.0088 | 0.0249 |
| F_MIL | F_MER | 2.55 | 0.0107 | 0.0297 |
| S_FRH | F_MRH | 2.52 | 0.0116 | 0.0316 |
| S_DER | S_BUT | 2.51 | 0.0121 | 0.0323 |
| F_FRH | F_MRH | 2.39 | 0.0171 | 0.0447 |
| S_MIL | L_USR | 2.34 | 0.0191 | 0.0490 |
Compare distributions of values.
Distribution of evenness of allelic diversity per locus for individuals grouped by run type within tributaries.
Table S7: Mean, standard deviation, median, 25th, and 75thquartile of evenness of evenness per locus for individuals grouped byrun type & tributary.
| GRP | mean | sd | median | Q1 | Q3 |
|---|---|---|---|---|---|
| F_COL | 0.7645 | 0.1556 | 0.7720 | 0.6506 | 0.8990 |
| F_MIL | 0.7681 | 0.1568 | 0.7747 | 0.6461 | 0.8990 |
| F_DER | 0.7631 | 0.1612 | 0.7720 | 0.6257 | 0.8990 |
| F_BUT | 0.7617 | 0.1602 | 0.7747 | 0.6336 | 0.8990 |
| F_FRH | 0.7697 | 0.1540 | 0.7823 | 0.6478 | 0.8990 |
| F_NIM | 0.7631 | 0.1582 | 0.7720 | 0.6321 | 0.8990 |
| F_MKH | 0.7632 | 0.1580 | 0.7703 | 0.6397 | 0.8990 |
| F_STN | 0.7631 | 0.1596 | 0.7786 | 0.6336 | 0.8990 |
| F_TOU | 0.7620 | 0.1585 | 0.7703 | 0.6353 | 0.8988 |
| F_MRH | 0.7648 | 0.1614 | 0.7720 | 0.6397 | 0.8990 |
| F_MER | 0.7634 | 0.1581 | 0.7720 | 0.6321 | 0.8984 |
| L_USR | 0.7609 | 0.1613 | 0.7769 | 0.6397 | 0.8990 |
| W_USR | 0.6925 | 0.1988 | 0.6853 | 0.5040 | 0.8694 |
| S_MIL | 0.7663 | 0.1615 | 0.7734 | 0.6397 | 0.8990 |
| S_DER | 0.7633 | 0.1600 | 0.7684 | 0.6397 | 0.8990 |
| S_BUT | 0.7578 | 0.1695 | 0.7769 | 0.6223 | 0.9085 |
| S_FRH | 0.7715 | 0.1577 | 0.7534 | 0.6397 | 0.9240 |
\(\theta_{T}\) (Nei 1987), the nucleotide diversity (\(\pi\)) is calculated as the sum of the number of differences between pairs of haplotypes of a given locus, divided by the number of comparisons made. This parameter is biased towards alleles segregating at intermediate rates and will underestimate genetic diversity due to rare alleles.
Create a tidy data set of variant call information (position,
alternate alleles) from filtered VCF file used to create haplotypes.
Create a data frame of haplotype sequences in the data set(s) using
reference contig sequences, VCF information, and codes.out
file from rad_haplotyper.
# Read in the reference sequence
fa <- seqinr::read.fasta("data/REF/CHRIST2018/reference.fasta")
# create tibble with locus name and sequence
ref <- tibble::tibble(name = seqinr::getName(fa), seq = toupper(unlist(seqinr::getSequence(fa, as.string = TRUE))))
# Read the haplotype VCF file and put it in tidy format
vcf_tidy <- read.vcfR("data/POPGEN/ONC.haps.filtered.recode.vcf") %>%
vcfR2tidy()
# Get the data frame with the positions
pos_tbl <- vcf_tidy$fix %>%
filter(ALT %in% c("A", "T", "C", "G")) %>%
unite(LOCUS, CHROM, POS, sep = "_", remove = FALSE)
# Make a data frame with all of the haplotype sequences from the 'codes.out' file from rad_haplotyper
hap_index <- read_tsv("data/HAPLOTYPING/codes.ONC.gen", col_names = FALSE) %>%
filter(X1 %in% locNames(gen_obj)) %>%
split(.$X1) %>%
purrr::map(function(x) {
codes <- str_split(x$X2, ",") %>%
unlist()
code_tbl <- tibble(locus = x$X1, code = codes) %>%
separate(code, c("hap", "code"), sep = ":") %>%
left_join(ref, by = c("locus" = "name"))
pos <- pos_tbl %>%
filter(CHROM == code_tbl$locus[1]) %>%
pull(POS)
for (i in 1:nrow(code_tbl)) {
alleles <- str_split(code_tbl$hap[i], "") %>%
unlist()
replace <- tibble(pos = pos, allele = alleles)
for (j in 1:nrow(replace)) {
str_sub(code_tbl$seq[i], replace$pos[j], 1) <- replace$allele[j]
}
}
code_tbl
}) %>%
bind_rows()
Calculate nucleotide diversity for individuals grouped by run type within each tributary.
# set pops
setPop(gen) <- ~RUN_LOC
# create tidy data set
gen_tidy <- gen %>%
genind2df(oneColPerAll = TRUE) %>%
rownames_to_column(var = "ind") %>%
gather(allele, code, -pop, -ind) %>%
separate(allele, c("locus", "allele"), sep = "\\.") %>%
filter(pop %in% trib_runs) %>%
droplevels()
# Calculate haplotype related stats
hap_div_tbl <- locNames(gen) %>%
purrr::map(function(y) {
# for each locus
gen_tidy %>%
filter(locus == y) %>%
filter(!code == "NA") %>%
left_join(hap_index, by = c("code", "locus")) %>%
as.tibble() %>%
arrange(factor(pop, levels = trib_runs)) %>%
# for each population per locus
split(.$pop) %>%
purrr::map(function(x) {
y <- do.call(rbind, strsplit(x$seq, ""))
rownames(y) <- paste(x$ind, x$allele, sep = ".")
# create DNAbin object for locus
dna_bin <- as.DNAbin(y)
# calculate Tajima's D and p-values
taj_test <- tajima.test(dna_bin)
# create table with results
tibble(locus = x$locus[1],
pop = x$pop[1],
nuc_div = nuc.div(dna_bin)
}) %>%
bind_rows()
}) %>%
bind_rows()
write_delim(hap_div_tbl, "results/trib_run.hapdiv", delim = "\t")
Test of significant differences among runs using Friedman’s test. To get an unreplicated complete block design only loci variable in all locations were used for test of heterogeneity among groups.
##
## Friedman rank sum test
##
## data: nuc_div and pop and locus
## Friedman chi-squared = 3466.1, df = 16, p-value < 0.00000000000000022
Test for significance of pairwise comparisons using Wilcoxon signed rank test.
Pairwise comparison of levels of significant differences among individuals grouped by tributary and run type.
Compare distribution of values.
Distribution of nucleotide diversity across all loci for individuals grouped by tributary and runs
Significance of pairwise comparisons of nucleotide diversityamong runs within tributaries.
| pop1 | pop2 | stat | p.value | p_adj |
|---|---|---|---|---|
| F_FRH | W_USR | 39.1630 | 0.0000 | 0.0000 |
| S_DER | W_USR | 38.7344 | 0.0000 | 0.0000 |
| F_COL | W_USR | 38.0562 | 0.0000 | 0.0000 |
| F_MIL | W_USR | 36.2643 | 0.0000 | 0.0000 |
| F_MER | W_USR | 35.8979 | 0.0000 | 0.0000 |
| F_NIM | W_USR | 35.4991 | 0.0000 | 0.0000 |
| F_TOU | W_USR | 35.3306 | 0.0000 | 0.0000 |
| F_MKH | W_USR | 34.9710 | 0.0000 | 0.0000 |
| F_STN | W_USR | 34.8903 | 0.0000 | 0.0000 |
| L_USR | W_USR | 34.3190 | 0.0000 | 0.0000 |
| F_BUT | W_USR | 33.7188 | 0.0000 | 0.0000 |
| S_MIL | W_USR | 33.4761 | 0.0000 | 0.0000 |
| F_DER | W_USR | 33.3037 | 0.0000 | 0.0000 |
| F_MRH | W_USR | 32.6041 | 0.0000 | 0.0000 |
| S_BUT | W_USR | 29.3361 | 0.0000 | 0.0000 |
| S_FRH | W_USR | 25.7698 | 0.0000 | 0.0000 |
| F_FRH | S_BUT | 10.6605 | 0.0000 | 0.0000 |
| F_MIL | S_BUT | 9.5402 | 0.0000 | 0.0000 |
| F_COL | S_BUT | 9.3455 | 0.0000 | 0.0000 |
| S_DER | S_BUT | 9.2374 | 0.0000 | 0.0000 |
| F_FRH | F_MER | 8.4649 | 0.0000 | 0.0000 |
| F_FRH | F_MKH | 8.0066 | 0.0000 | 0.0000 |
| F_FRH | F_NIM | 7.8470 | 0.0000 | 0.0000 |
| F_MIL | F_BUT | 7.8298 | 0.0000 | 0.0000 |
| F_FRH | F_BUT | 7.8248 | 0.0000 | 0.0000 |
| S_MIL | S_BUT | 7.5732 | 0.0000 | 0.0000 |
| F_FRH | L_USR | 7.5485 | 0.0000 | 0.0000 |
| F_FRH | F_TOU | 7.3375 | 0.0000 | 0.0000 |
| F_NIM | S_BUT | 7.2899 | 0.0000 | 0.0000 |
| F_MIL | L_USR | 7.1077 | 0.0000 | 0.0000 |
| F_MER | S_BUT | 7.0675 | 0.0000 | 0.0000 |
| F_MIL | F_STN | 6.9027 | 0.0000 | 0.0000 |
| F_MKH | S_BUT | 6.7397 | 0.0000 | 0.0000 |
| F_FRH | F_DER | 6.4714 | 0.0000 | 0.0000 |
| F_MRH | S_BUT | 6.4262 | 0.0000 | 0.0000 |
| F_TOU | S_BUT | 6.4188 | 0.0000 | 0.0000 |
| F_FRH | S_FRH | 6.3862 | 0.0000 | 0.0000 |
| F_DER | S_BUT | 6.1653 | 0.0000 | 0.0000 |
| F_FRH | F_STN | 6.1338 | 0.0000 | 0.0000 |
| F_COL | L_USR | 5.6465 | 0.0000 | 0.0000 |
| F_FRH | S_MIL | 5.6304 | 0.0000 | 0.0000 |
| F_FRH | F_COL | 5.0746 | 0.0000 | 0.0000 |
| F_MIL | F_TOU | 4.9916 | 0.0000 | 0.0000 |
| F_COL | F_BUT | 4.9583 | 0.0000 | 0.0000 |
| F_FRH | F_MRH | 4.4841 | 0.0000 | 0.0000 |
| F_MIL | F_MKH | 4.4521 | 0.0000 | 0.0000 |
| F_COL | S_FRH | 4.4422 | 0.0000 | 0.0000 |
| S_DER | S_FRH | 4.3559 | 0.0000 | 0.0000 |
| F_MIL | S_FRH | 4.3219 | 0.0000 | 0.0000 |
| F_MIL | F_MER | 4.2738 | 0.0000 | 0.0000 |
| F_STN | S_BUT | 4.2074 | 0.0000 | 0.0000 |
| F_BUT | S_BUT | 4.1849 | 0.0000 | 0.0000 |
| S_DER | F_TOU | 3.9544 | 0.0001 | 0.0003 |
| F_MIL | F_NIM | 3.8064 | 0.0001 | 0.0003 |
| F_COL | F_STN | 3.4889 | 0.0005 | 0.0012 |
| F_MRH | F_BUT | 3.4750 | 0.0005 | 0.0012 |
| S_DER | F_BUT | 3.3535 | 0.0008 | 0.0019 |
| S_DER | F_MKH | 3.2866 | 0.0010 | 0.0023 |
| F_DER | F_BUT | 3.2501 | 0.0012 | 0.0027 |
| F_DER | L_USR | 3.2285 | 0.0012 | 0.0027 |
| L_USR | S_BUT | 3.1631 | 0.0016 | 0.0035 |
| S_DER | F_MER | 3.1512 | 0.0016 | 0.0035 |
| F_NIM | F_BUT | 2.8420 | 0.0045 | 0.0097 |
| S_DER | F_NIM | 2.8292 | 0.0047 | 0.0100 |
| S_DER | L_USR | 2.7652 | 0.0057 | 0.0119 |
| F_NIM | S_FRH | 2.7592 | 0.0058 | 0.0120 |
| F_MIL | F_DER | 2.7509 | 0.0059 | 0.0120 |
| F_NIM | L_USR | 2.7437 | 0.0061 | 0.0122 |
| F_MKH | L_USR | 2.7356 | 0.0062 | 0.0122 |
| F_MRH | F_DER | 2.6655 | 0.0077 | 0.0150 |
| F_MER | F_BUT | 2.6482 | 0.0081 | 0.0155 |
| F_MKH | F_MER | 2.5498 | 0.0108 | 0.0203 |
| F_COL | F_MER | 2.5450 | 0.0109 | 0.0203 |
| S_DER | F_DER | 2.4792 | 0.0132 | 0.0243 |
| F_MRH | L_USR | 2.4721 | 0.0134 | 0.0243 |
| F_MKH | F_NIM | 2.4247 | 0.0153 | 0.0274 |
| F_MER | L_USR | 2.3802 | 0.0173 | 0.0306 |
| S_DER | S_MIL | 2.3737 | 0.0176 | 0.0307 |
| F_TOU | S_FRH | 2.3173 | 0.0205 | 0.0353 |
| S_FRH | S_BUT | 2.3070 | 0.0211 | 0.0359 |
| S_DER | F_STN | 2.3011 | 0.0214 | 0.0359 |
| F_COL | S_MIL | 2.2854 | 0.0223 | 0.0370 |
| F_MKH | F_BUT | 2.2459 | 0.0247 | 0.0403 |
| F_MER | S_FRH | 2.2424 | 0.0249 | 0.0403 |
| S_MIL | L_USR | 2.2186 | 0.0265 | 0.0424 |
| S_DER | F_MRH | 2.1733 | 0.0298 | 0.0471 |
| F_MKH | S_FRH | 2.1338 | 0.0329 | 0.0514 |
| F_BUT | L_USR | 2.1021 | 0.0355 | 0.0549 |
| S_MIL | F_BUT | 2.0595 | 0.0394 | 0.0602 |
| F_TOU | F_BUT | 1.9216 | 0.0547 | 0.0827 |
| F_TOU | L_USR | 1.8511 | 0.0642 | 0.0959 |
| F_MIL | F_MRH | 1.7711 | 0.0765 | 0.1131 |
| F_MIL | S_MIL | 1.7501 | 0.0801 | 0.1160 |
| S_MIL | F_STN | 1.7494 | 0.0802 | 0.1160 |
| F_MRH | S_MIL | 1.7340 | 0.0829 | 0.1187 |
| F_NIM | F_STN | 1.6828 | 0.0924 | 0.1308 |
| F_MRH | F_STN | 1.6749 | 0.0940 | 0.1308 |
| F_FRH | S_DER | 1.6699 | 0.0949 | 0.1308 |
| F_STN | L_USR | 1.6688 | 0.0952 | 0.1308 |
| F_MIL | F_COL | 1.6428 | 0.1004 | 0.1365 |
| F_COL | F_TOU | 1.5766 | 0.1149 | 0.1547 |
| F_MIL | S_DER | 1.5062 | 0.1320 | 0.1760 |
| F_NIM | F_MER | 1.4076 | 0.1593 | 0.2103 |
| F_MRH | S_FRH | 1.3832 | 0.1666 | 0.2179 |
| F_MER | F_STN | 1.3571 | 0.1748 | 0.2264 |
| F_MRH | F_MER | 1.3084 | 0.1907 | 0.2434 |
| F_COL | F_MKH | 1.3062 | 0.1915 | 0.2434 |
| F_MKH | F_DER | 1.2650 | 0.2059 | 0.2593 |
| F_TOU | F_MER | 1.2137 | 0.2249 | 0.2806 |
| F_COL | F_DER | 1.1711 | 0.2415 | 0.2986 |
| F_MRH | F_TOU | 1.1316 | 0.2578 | 0.3159 |
| F_FRH | F_MIL | 1.1019 | 0.2705 | 0.3285 |
| F_MKH | S_MIL | 1.0733 | 0.2832 | 0.3408 |
| F_COL | F_NIM | 1.0207 | 0.3074 | 0.3667 |
| F_BUT | F_STN | 1.0124 | 0.3113 | 0.3681 |
| F_NIM | S_MIL | 0.9788 | 0.3277 | 0.3842 |
| F_MRH | F_MKH | 0.9442 | 0.3451 | 0.4011 |
| F_DER | F_STN | 0.8775 | 0.3802 | 0.4382 |
| F_TOU | F_NIM | 0.8558 | 0.3921 | 0.4481 |
| F_COL | F_MRH | 0.8145 | 0.4154 | 0.4708 |
| F_BUT | S_FRH | 0.7313 | 0.4646 | 0.5222 |
| F_NIM | F_DER | 0.7121 | 0.4764 | 0.5311 |
| F_STN | S_FRH | 0.6743 | 0.5001 | 0.5530 |
| S_MIL | S_FRH | 0.5758 | 0.5648 | 0.6158 |
| F_MRH | F_NIM | 0.5739 | 0.5660 | 0.6158 |
| S_DER | F_COL | 0.5644 | 0.5725 | 0.6179 |
| F_MKH | F_TOU | 0.4981 | 0.6184 | 0.6622 |
| F_TOU | S_MIL | 0.4698 | 0.6385 | 0.6784 |
| S_FRH | L_USR | 0.4294 | 0.6676 | 0.7038 |
| F_TOU | F_DER | 0.3588 | 0.7197 | 0.7529 |
| F_MER | S_MIL | 0.3470 | 0.7286 | 0.7564 |
| F_DER | S_MIL | 0.3213 | 0.7479 | 0.7706 |
| F_MKH | F_STN | 0.3003 | 0.7639 | 0.7811 |
| F_STN | F_TOU | 0.2000 | 0.8415 | 0.8541 |
| F_DER | S_FRH | 0.1108 | 0.9118 | 0.9186 |
| F_MER | F_DER | 0.0505 | 0.9597 | 0.9597 |
Table S8: Comparison of mean +/- standard deviation, minimum,and maximum values of the nucleotide diversity (pi) across loci amongindividuals grouped by run within tributary.
| pop | mean | std | max | min |
|---|---|---|---|---|
| F_COL | 0.001045 | 0.001275 | 0.014935 | 0 |
| F_MIL | 0.001056 | 0.001298 | 0.015935 | 0 |
| F_DER | 0.001038 | 0.001294 | 0.014056 | 0 |
| F_BUT | 0.001026 | 0.001292 | 0.014613 | 0 |
| F_FRH | 0.001058 | 0.001272 | 0.013546 | 0 |
| F_NIM | 0.001034 | 0.001271 | 0.014977 | 0 |
| F_MKH | 0.001035 | 0.001276 | 0.015955 | 0 |
| F_STN | 0.001028 | 0.001281 | 0.014589 | 0 |
| F_TOU | 0.001033 | 0.001275 | 0.014795 | 0 |
| F_MRH | 0.001042 | 0.001311 | 0.015478 | 0 |
| F_MER | 0.001035 | 0.001274 | 0.014661 | 0 |
| L_USR | 0.001029 | 0.001289 | 0.014423 | 0 |
| W_USR | 0.000790 | 0.001212 | 0.013255 | 0 |
| S_MIL | 0.001043 | 0.001315 | 0.016022 | 0 |
| S_DER | 0.001046 | 0.001275 | 0.015410 | 0 |
| S_BUT | 0.001012 | 0.001308 | 0.015957 | 0 |
| S_FRH | 0.001048 | 0.001392 | 0.014501 | 0 |
\(\theta\) is the population-scaled mutation rate and can be estimated as \(\hat{\theta}_T\) (Nei 1987), the number of pairwise differences (nucleotide diversity, \(\pi\)), while \(\hat{\theta}_W\) is measured as the number of segregating sites (Watterson 1975).
Because \(\hat{\theta}_T\) will underestimate the number of mutations that are rare in the population, Tajima’s \(D\) can be used to test the neutral mutation hypothesis:
Test of significant differences among runs using Friedman’s test. To get an unreplicated complete block design only loci variable (Tajima’s D is able to be calculated) in all locations were used for test of heterogeneity among groups.
##
## Friedman rank sum test
##
## data: tajima_d and pop and locus
## Friedman chi-squared = 3033.8, df = 16, p-value < 0.00000000000000022
Test for significance of pairwise comparisons using Wilcoxon signed rank test.
Pairwise comparison of levels of significant differences among individuals grouped by tributary and run type.
Compare distribution of values.
Distribution of nucleotide diversity across all loci for individuals grouped by tributary and runs
Significance of pairwise comparisons of nucleotide diversityamong runs within tributaries.
| pop1 | pop2 | stat | p.value | p_adj |
|---|---|---|---|---|
| F_FRH | S_FRH | 26.5849 | 0.0000 | 0.0000 |
| F_MER | S_FRH | 26.2976 | 0.0000 | 0.0000 |
| F_COL | S_FRH | 25.5437 | 0.0000 | 0.0000 |
| F_NIM | S_FRH | 24.8900 | 0.0000 | 0.0000 |
| F_MKH | S_FRH | 24.0444 | 0.0000 | 0.0000 |
| F_TOU | S_FRH | 23.8379 | 0.0000 | 0.0000 |
| S_DER | S_FRH | 21.3976 | 0.0000 | 0.0000 |
| F_STN | S_FRH | 19.9711 | 0.0000 | 0.0000 |
| F_COL | W_USR | 19.7533 | 0.0000 | 0.0000 |
| F_MER | W_USR | 19.2801 | 0.0000 | 0.0000 |
| F_MER | F_DER | 19.2761 | 0.0000 | 0.0000 |
| F_MER | F_MRH | 19.2191 | 0.0000 | 0.0000 |
| F_FRH | W_USR | 19.1950 | 0.0000 | 0.0000 |
| F_NIM | W_USR | 18.7630 | 0.0000 | 0.0000 |
| L_USR | S_FRH | 18.6359 | 0.0000 | 0.0000 |
| F_COL | F_DER | 18.4471 | 0.0000 | 0.0000 |
| F_FRH | F_MRH | 18.4293 | 0.0000 | 0.0000 |
| F_COL | F_MRH | 18.3611 | 0.0000 | 0.0000 |
| F_NIM | F_MRH | 18.1141 | 0.0000 | 0.0000 |
| S_DER | W_USR | 17.9094 | 0.0000 | 0.0000 |
| F_NIM | F_DER | 17.6786 | 0.0000 | 0.0000 |
| F_FRH | F_DER | 17.6316 | 0.0000 | 0.0000 |
| F_BUT | S_FRH | 17.6218 | 0.0000 | 0.0000 |
| F_MKH | W_USR | 17.5967 | 0.0000 | 0.0000 |
| F_TOU | W_USR | 17.5878 | 0.0000 | 0.0000 |
| F_MIL | S_FRH | 17.2354 | 0.0000 | 0.0000 |
| F_MKH | F_MRH | 16.5180 | 0.0000 | 0.0000 |
| F_MKH | F_DER | 16.3529 | 0.0000 | 0.0000 |
| F_TOU | F_MRH | 16.0814 | 0.0000 | 0.0000 |
| F_TOU | F_DER | 15.2753 | 0.0000 | 0.0000 |
| F_STN | W_USR | 14.9442 | 0.0000 | 0.0000 |
| S_BUT | S_FRH | 14.7833 | 0.0000 | 0.0000 |
| L_USR | W_USR | 14.6604 | 0.0000 | 0.0000 |
| S_MIL | S_FRH | 14.0956 | 0.0000 | 0.0000 |
| F_MER | F_BUT | 13.6490 | 0.0000 | 0.0000 |
| F_BUT | W_USR | 13.4927 | 0.0000 | 0.0000 |
| F_COL | F_MIL | 13.4734 | 0.0000 | 0.0000 |
| F_DER | S_FRH | 13.3879 | 0.0000 | 0.0000 |
| F_MIL | W_USR | 13.2877 | 0.0000 | 0.0000 |
| F_MER | F_MIL | 13.0190 | 0.0000 | 0.0000 |
| S_BUT | W_USR | 12.9179 | 0.0000 | 0.0000 |
| F_COL | S_MIL | 12.6701 | 0.0000 | 0.0000 |
| F_NIM | F_BUT | 12.5090 | 0.0000 | 0.0000 |
| F_MRH | S_FRH | 12.4327 | 0.0000 | 0.0000 |
| F_FRH | F_MIL | 12.3683 | 0.0000 | 0.0000 |
| F_FRH | S_MIL | 12.3372 | 0.0000 | 0.0000 |
| F_COL | F_BUT | 12.3272 | 0.0000 | 0.0000 |
| F_MER | S_MIL | 12.2343 | 0.0000 | 0.0000 |
| F_NIM | F_MIL | 12.1236 | 0.0000 | 0.0000 |
| F_FRH | F_BUT | 12.0649 | 0.0000 | 0.0000 |
| S_MIL | W_USR | 11.7525 | 0.0000 | 0.0000 |
| F_NIM | S_MIL | 11.6273 | 0.0000 | 0.0000 |
| S_DER | F_MRH | 11.5480 | 0.0000 | 0.0000 |
| S_DER | S_MIL | 11.3109 | 0.0000 | 0.0000 |
| S_DER | F_DER | 11.2090 | 0.0000 | 0.0000 |
| F_STN | F_MRH | 10.7660 | 0.0000 | 0.0000 |
| F_DER | W_USR | 10.3608 | 0.0000 | 0.0000 |
| F_MER | F_STN | 10.3304 | 0.0000 | 0.0000 |
| F_MER | L_USR | 10.2101 | 0.0000 | 0.0000 |
| F_MKH | F_BUT | 10.1922 | 0.0000 | 0.0000 |
| F_MKH | F_MIL | 9.9919 | 0.0000 | 0.0000 |
| F_COL | S_BUT | 9.9542 | 0.0000 | 0.0000 |
| F_NIM | F_STN | 9.8380 | 0.0000 | 0.0000 |
| F_COL | L_USR | 9.8129 | 0.0000 | 0.0000 |
| F_STN | F_DER | 9.6685 | 0.0000 | 0.0000 |
| F_MKH | S_MIL | 9.6327 | 0.0000 | 0.0000 |
| F_MER | S_BUT | 9.5989 | 0.0000 | 0.0000 |
| F_MRH | W_USR | 9.5451 | 0.0000 | 0.0000 |
| F_TOU | F_MIL | 9.4676 | 0.0000 | 0.0000 |
| F_COL | F_STN | 9.4089 | 0.0000 | 0.0000 |
| F_FRH | S_BUT | 9.4079 | 0.0000 | 0.0000 |
| F_TOU | F_BUT | 9.3481 | 0.0000 | 0.0000 |
| F_TOU | S_MIL | 9.3395 | 0.0000 | 0.0000 |
| L_USR | F_MRH | 9.1500 | 0.0000 | 0.0000 |
| F_FRH | L_USR | 9.0957 | 0.0000 | 0.0000 |
| F_FRH | F_STN | 8.8313 | 0.0000 | 0.0000 |
| F_NIM | S_BUT | 8.7719 | 0.0000 | 0.0000 |
| F_NIM | L_USR | 8.7611 | 0.0000 | 0.0000 |
| L_USR | F_DER | 8.1895 | 0.0000 | 0.0000 |
| S_DER | S_BUT | 7.8476 | 0.0000 | 0.0000 |
| F_BUT | F_MRH | 7.3951 | 0.0000 | 0.0000 |
| F_TOU | S_BUT | 7.3056 | 0.0000 | 0.0000 |
| F_MKH | S_BUT | 7.0906 | 0.0000 | 0.0000 |
| F_MKH | F_STN | 6.8108 | 0.0000 | 0.0000 |
| F_BUT | F_DER | 6.7771 | 0.0000 | 0.0000 |
| S_DER | F_MIL | 6.7527 | 0.0000 | 0.0000 |
| F_TOU | L_USR | 6.3255 | 0.0000 | 0.0000 |
| F_MIL | F_MRH | 6.3219 | 0.0000 | 0.0000 |
| F_MKH | L_USR | 6.2571 | 0.0000 | 0.0000 |
| S_DER | F_BUT | 6.1590 | 0.0000 | 0.0000 |
| F_TOU | F_STN | 5.8991 | 0.0000 | 0.0000 |
| F_MIL | F_DER | 5.8486 | 0.0000 | 0.0000 |
| F_STN | S_MIL | 5.2547 | 0.0000 | 0.0000 |
| S_BUT | F_MRH | 4.5124 | 0.0000 | 0.0000 |
| L_USR | S_MIL | 4.4513 | 0.0000 | 0.0000 |
| F_MER | F_TOU | 4.4447 | 0.0000 | 0.0000 |
| F_MER | F_MKH | 4.2973 | 0.0000 | 0.0000 |
| S_DER | L_USR | 4.2409 | 0.0000 | 0.0000 |
| S_BUT | F_DER | 3.8866 | 0.0001 | 0.0001 |
| F_COL | S_DER | 3.8819 | 0.0001 | 0.0001 |
| F_STN | F_MIL | 3.8198 | 0.0001 | 0.0001 |
| F_COL | F_MKH | 3.8111 | 0.0001 | 0.0001 |
| F_COL | F_TOU | 3.7087 | 0.0002 | 0.0003 |
| S_DER | F_STN | 3.6931 | 0.0002 | 0.0003 |
| F_MER | S_DER | 3.6497 | 0.0003 | 0.0004 |
| F_FRH | S_DER | 3.3763 | 0.0007 | 0.0009 |
| F_STN | F_BUT | 3.2806 | 0.0010 | 0.0013 |
| F_NIM | F_TOU | 3.2762 | 0.0011 | 0.0014 |
| F_STN | S_BUT | 3.2630 | 0.0011 | 0.0014 |
| F_FRH | F_TOU | 3.2079 | 0.0013 | 0.0016 |
| L_USR | F_MIL | 3.1097 | 0.0019 | 0.0023 |
| F_NIM | S_DER | 3.0337 | 0.0024 | 0.0029 |
| F_BUT | S_MIL | 3.0331 | 0.0024 | 0.0029 |
| F_FRH | F_MKH | 2.9880 | 0.0028 | 0.0033 |
| F_NIM | F_MKH | 2.8695 | 0.0041 | 0.0048 |
| S_MIL | F_MRH | 2.7808 | 0.0054 | 0.0063 |
| L_USR | S_BUT | 2.4143 | 0.0158 | 0.0184 |
| F_MIL | S_MIL | 2.3319 | 0.0197 | 0.0227 |
| S_BUT | S_MIL | 1.8825 | 0.0598 | 0.0683 |
| L_USR | F_BUT | 1.8600 | 0.0629 | 0.0713 |
| S_MIL | F_DER | 1.6147 | 0.1064 | 0.1196 |
| F_STN | L_USR | 1.0432 | 0.2968 | 0.3309 |
| F_MER | F_NIM | 0.9721 | 0.3310 | 0.3660 |
| F_COL | F_NIM | 0.9512 | 0.3415 | 0.3745 |
| F_MKH | S_DER | 0.8753 | 0.3814 | 0.4150 |
| F_DER | F_MRH | 0.8548 | 0.3927 | 0.4239 |
| F_BUT | S_BUT | 0.7211 | 0.4708 | 0.5042 |
| F_COL | F_FRH | 0.7129 | 0.4759 | 0.5056 |
| F_TOU | S_DER | 0.6973 | 0.4856 | 0.5120 |
| F_MER | F_FRH | 0.5523 | 0.5808 | 0.6076 |
| S_FRH | W_USR | 0.3931 | 0.6943 | 0.7208 |
| F_MKH | F_TOU | 0.3647 | 0.7153 | 0.7370 |
| F_MIL | S_BUT | 0.3156 | 0.7523 | 0.7693 |
| F_FRH | F_NIM | 0.2919 | 0.7703 | 0.7818 |
| F_MER | F_COL | 0.2378 | 0.8121 | 0.8181 |
| F_BUT | F_MIL | 0.1100 | 0.9124 | 0.9124 |
Table S9: Comparison of mean +/- standard deviation, minimum,and maximum values of the Tajima’s D across loci among individualsgrouped by run within tributary.
| pop | mean | std | max | min |
|---|---|---|---|---|
| W_USR | 0.094266 | 1.035023 | 2.669786 | -1.858645 |
| S_BUT | 0.049145 | 0.954543 | 2.709811 | -2.180529 |
| L_USR | -0.046518 | 0.956674 | 2.592781 | -1.876505 |
| S_MIL | -0.050683 | 0.963661 | 2.447507 | -1.889476 |
| F_MKH | -0.058910 | 0.953114 | 2.587745 | -1.763630 |
| F_NIM | -0.059634 | 0.951602 | 2.662807 | -1.847405 |
| S_DER | -0.062072 | 0.945730 | 2.648186 | -1.866038 |
| F_MER | -0.063557 | 0.953941 | 2.712666 | -1.975765 |
| F_TOU | -0.068298 | 0.951630 | 2.704980 | -1.854902 |
| F_COL | -0.071360 | 0.942641 | 2.673614 | -1.847405 |
| F_FRH | -0.076019 | 0.946088 | 2.648186 | -1.764299 |
| F_STN | -0.080884 | 0.962657 | 2.556259 | -2.001031 |
| S_FRH | -0.087391 | 0.977333 | 2.145134 | -1.797595 |
| F_MRH | -0.088288 | 0.959500 | 2.483303 | -2.005646 |
| F_MIL | -0.093226 | 0.953617 | 2.386602 | -1.900752 |
| F_BUT | -0.095481 | 0.956493 | 2.495917 | -1.879671 |
| F_DER | -0.107183 | 0.962557 | 2.418479 | -1.733626 |
Without detailed demographic information, it can be difficult to distinguish between an effect of selection, population expansion/decline and drift compared to demographics. Calculating a conventional p-value is problematic because you cannot describe the distribution of Tajimas \(D\) independent of true \(\theta\) value which is unknown.
Originally, Tajima proposed comparing the distribution to a beta
distribution, however this has been shown to be too conservative.
Instead, to identify whether the observed genome-wide distribution of
Tajima’s \(D\) reflects the expected
distribution under the neutral mutation hypothesis, we will generate a
genome-wide null distribution of Tajima’s \(D\) for a set of neutral loci reflecting
the composition of the haplotyped empirical data set consisting of the
same number of loci with same distribution of segregating sites by
simulating loci under coalescence using MS (Hudson 2002) to
compare to the empirical data set.
The maximum number of segregating sites per locus in the data set is 7, there are 30) F_COL run individuals in the data set.
Simulate 1000 independent replicates (loci) for 1 - 7 segregating sites under neutral model for 30 individuals (60 observations).
# 2*number of individuals in data set
N=60
# number of loci to simulate
REP=1000
# simulate neutral loci
for i in {1..7}
do
scr/MS/ms $N $REP -s $i | scr/MS/sample_stats > results/Ind60_Sites${i}_neut_Rep1000.stats
done
The sample_stat script distributed along side
MS calculates summary statistics including Tajima’s \(D\) for simulated set of loci.
Generate 100 simulated data sets based on empirical data set, by determining the number of loci with 1 … x segregating sites and randomly drawing that number of loci from simulated data set (1,000 simulated loci per bin; draw with replacement) to generate null distribution consisting of the same number of loci with the same distribution of segregating sites per locus as the empirical data set.
# create a vector of stats files based on simulations
filenames <- list.files(path = "results/", pattern = "Ind60")
# create an empty list that will serve as a container to receive the incoming files
sim <-list()
# create a loop to read in your data
for (i in 1:length(filenames)){
sim[[i]] <- read_delim(file.path("results", filenames[i]), delim = "\t",
col_names = c("t1", "pi",
"t2", "seg_sites",
"t3", "tajima_d",
"t4", "thetaH",
"t5", "H")) %>%
select(pi, seg_sites, tajima_d, thetaH, H)
}
# add names
names(sim) <- filenames
# combine into dataframe
sim <- ldply(sim, data.frame) %>%
rename(simulation = `.id`) %>%
separate(simulation, into = c("temp", "t1"), sep = -6) %>%
separate(temp, into = c("N_IND", "t2", "scenario", "N_REPS"), sep = "_") %>%
mutate(N_REPS = as.numeric(str_extract(N_REPS, "[[:digit:]]+")),
N_IND = as.numeric(str_extract(N_IND, "[[:digit:]]+"))) %>%
select(-t1, -t2)
reps <- 100
# group to be compared
grp <- "F_COL"
# get counts for number of segregating sites
emp <- read_delim("results/trib_run.hapdiv", delim = "\t") %>%
select(locus, pop, seg_sites) %>%
filter(!seg_sites == 0) %>%
filter(pop == grp) %>%
count(seg_sites)
# create null distributions
null_distributions <- list()
for (s in emp$seg_sites) {
# determine number of loci with s segregating sites
t <- emp %>%
filter(seg_sites == s)
n_loci <- t$n
# sample n_loci simulated loci (with replacement)
null_distributions[[s]] <- sim %>%
filter(seg_sites == s) %>%
sample_n(size = n_loci, replace = TRUE)
}
null_distributions <- ldply(null_distributions, data.frame) %>%
mutate(pop = grp)
taj <- read_delim("results/trib_run.hapdiv", delim = "\t") %>%
select(locus, pop, seg_sites, tajima_d) %>%
filter(!seg_sites == 0,
pop == grp) %>%
bind_rows(null_distributions) %>%
mutate(scenario = ifelse(is.na(scenario), "empirical", scenario)) %>%
select(-N_IND, -N_REPS, -pi, -thetaH, -H)
taj_all <- list()
taj_all[[grp]] <- taj
Use K-S test statistic to identify simulated null distribution most similar to empirical data set (values closer to 0 indicate higher probability to have been draw from the same distribution).
Cumulative distribution curves of empirical data set (black) to simulated data set (grey). The black dashed lines indicates expected value for Tajima’s D under equilibrium assuming a beta distribution (the median should intersect at 0).
##
## Two-sample Kolmogorov-Smirnov test
##
## data: x and y
## D = 0.15772, p-value < 0.00000000000000022
## alternative hypothesis: two-sided
The maximum number of segregating sites per locus in the data set is 7, there are 20) F_MIL run individuals in the data set.
Simulate 1000 independent replicates (loci) for 1 - 7 segregating sites under neutral model for 20 individuals (40 observations).
# 2*number of individuals in data set
N=40
# number of loci to simulate
REP=1000
# simulate neutral loci
for i in {1..7}
do
scr/MS/ms $N $REP -s $i | scr/MS/sample_stats > results/Ind40_Sites${i}_neut_Rep1000.stats
done
The sample_stat script distributed along side
MS calculates summary statistics including Tajima’s \(D\) for simulated set of loci.
Generate 100 simulated data sets based on empirical data set, by determining the number of loci with 1 … x segregating sites and randomly drawing that number of loci from simulated data set (1,000 simulated loci per bin; draw with replacement) to generate null distribution consisting of the same number of loci with the same distribution of segregating sites per locus as the empirical data set.
# create a vector of stats files based on simulations
filenames <- list.files(path = "results/", pattern = "Ind40")
# create an empty list that will serve as a container to receive the incoming files
sim <-list()
# create a loop to read in your data
for (i in 1:length(filenames)){
sim[[i]] <- read_delim(file.path("results", filenames[i]), delim = "\t",
col_names = c("t1", "pi",
"t2", "seg_sites",
"t3", "tajima_d",
"t4", "thetaH",
"t5", "H")) %>%
select(pi, seg_sites, tajima_d, thetaH, H)
}
# add names
names(sim) <- filenames
# combine into dataframe
sim <- ldply(sim, data.frame) %>%
rename(simulation = `.id`) %>%
separate(simulation, into = c("temp", "t1"), sep = -6) %>%
separate(temp, into = c("N_IND", "t2", "scenario", "N_REPS"), sep = "_") %>%
mutate(N_REPS = as.numeric(str_extract(N_REPS, "[[:digit:]]+")),
N_IND = as.numeric(str_extract(N_IND, "[[:digit:]]+"))) %>%
select(-t1, -t2)
reps <- 100
# group to be compared
grp <- "F_MIL"
# get counts for number of segregating sites
emp <- read_delim("results/trib_run.hapdiv", delim = "\t") %>%
select(locus, pop, seg_sites) %>%
filter(!seg_sites == 0) %>%
filter(pop == grp) %>%
count(seg_sites)
# create null distributions
null_distributions <- list()
for (s in emp$seg_sites) {
# determine number of loci with s segregating sites
t <- emp %>%
filter(seg_sites == s)
n_loci <- t$n
# sample n_loci simulated loci (with replacement)
null_distributions[[s]] <- sim %>%
filter(seg_sites == s) %>%
sample_n(size = n_loci, replace = TRUE)
}
null_distributions <- ldply(null_distributions, data.frame) %>%
mutate(pop = grp)
taj <- read_delim("results/trib_run.hapdiv", delim = "\t") %>%
select(locus, pop, seg_sites, tajima_d) %>%
filter(!seg_sites == 0,
pop == grp) %>%
bind_rows(null_distributions) %>%
mutate(scenario = ifelse(is.na(scenario), "empirical", scenario)) %>%
select(-N_IND, -N_REPS, -pi, -thetaH, -H)
taj_all[[grp]] <- taj
Use K-S test statistic to identify simulated null distribution most similar to empirical data set (values closer to 0 indicate higher probability to have been draw from the same distribution).
Cumulative distribution curves of empirical data set (black) to simulated data set (grey). The black dashed lines indicates expected value for Tajima’s D under equilibrium assuming a beta distribution (the median should intersect at 0).
##
## Two-sample Kolmogorov-Smirnov test
##
## data: x and y
## D = 0.21342, p-value < 0.00000000000000022
## alternative hypothesis: two-sided
The maximum number of segregating sites per locus in the data set is 7, there are 15) F_DER run individuals in the data set.
Simulate 1000 independent replicates (loci) for 1 - 7 segregating sites under neutral model for 15 individuals (30 observations).
# 2*number of individuals in data set
N=30
# number of loci to simulate
REP=1000
# simulate neutral loci
for i in {1..7}
do
scr/MS/ms $N $REP -s $i | scr/MS/sample_stats > results/Ind30_Sites${i}_neut_Rep1000.stats
done
The sample_stat script distributed along side
MS calculates summary statistics including Tajima’s \(D\) for simulated set of loci.
Generate 100 simulated data sets based on empirical data set, by determining the number of loci with 1 … x segregating sites and randomly drawing that number of loci from simulated data set (1,000 simulated loci per bin; draw with replacement) to generate null distribution consisting of the same number of loci with the same distribution of segregating sites per locus as the empirical data set.
# create a vector of stats files based on simulations
filenames <- list.files(path = "results/", pattern = "Ind30")
# create an empty list that will serve as a container to receive the incoming files
sim <-list()
# create a loop to read in your data
for (i in 1:length(filenames)){
sim[[i]] <- read_delim(file.path("results", filenames[i]), delim = "\t",
col_names = c("t1", "pi",
"t2", "seg_sites",
"t3", "tajima_d",
"t4", "thetaH",
"t5", "H")) %>%
select(pi, seg_sites, tajima_d, thetaH, H)
}
# add names
names(sim) <- filenames
# combine into dataframe
sim <- ldply(sim, data.frame) %>%
rename(simulation = `.id`) %>%
separate(simulation, into = c("temp", "t1"), sep = -6) %>%
separate(temp, into = c("N_IND", "t2", "scenario", "N_REPS"), sep = "_") %>%
mutate(N_REPS = as.numeric(str_extract(N_REPS, "[[:digit:]]+")),
N_IND = as.numeric(str_extract(N_IND, "[[:digit:]]+"))) %>%
select(-t1, -t2)
reps <- 100
# group to be compared
grp <- "F_DER"
# get counts for number of segregating sites
emp <- read_delim("results/trib_run.hapdiv", delim = "\t") %>%
select(locus, pop, seg_sites) %>%
filter(!seg_sites == 0) %>%
filter(pop == grp) %>%
count(seg_sites)
# create null distributions
null_distributions <- list()
for (s in emp$seg_sites) {
# determine number of loci with s segregating sites
t <- emp %>%
filter(seg_sites == s)
n_loci <- t$n
# sample n_loci simulated loci (with replacement)
null_distributions[[s]] <- sim %>%
filter(seg_sites == s) %>%
sample_n(size = n_loci, replace = TRUE)
}
null_distributions <- ldply(null_distributions, data.frame) %>%
mutate(pop = grp)
taj <- read_delim("results/trib_run.hapdiv", delim = "\t") %>%
select(locus, pop, seg_sites, tajima_d) %>%
filter(!seg_sites == 0,
pop == grp) %>%
bind_rows(null_distributions) %>%
mutate(scenario = ifelse(is.na(scenario), "empirical", scenario)) %>%
select(-N_IND, -N_REPS, -pi, -thetaH, -H)
taj_all[[grp]] <- taj
Use K-S test statistic to identify simulated null distribution most similar to empirical data set (values closer to 0 indicate higher probability to have been draw from the same distribution).
Cumulative distribution curves of empirical data set (black) to simulated data set (grey). The black dashed lines indicates expected value for Tajima’s D under equilibrium assuming a beta distribution (the median should intersect at 0).
##
## Two-sample Kolmogorov-Smirnov test
##
## data: x and y
## D = 0.12052, p-value < 0.00000000000000022
## alternative hypothesis: two-sided
The maximum number of segregating sites per locus in the data set is 7, there are 21) F_BUT run individuals in the data set.
Simulate 1000 independent replicates (loci) for 1 - 7 segregating sites under neutral model for 21 individuals (42 observations).
# 2*number of individuals in data set
N=42
# number of loci to simulate
REP=1000
# simulate neutral loci
for i in {1..7}
do
scr/MS/ms $N $REP -s $i | scr/MS/sample_stats > results/Ind42_Sites${i}_neut_Rep1000.stats
done
The sample_stat script distributed along side
MS calculates summary statistics including Tajima’s \(D\) for simulated set of loci.
Generate 100 simulated data sets based on empirical data set, by determining the number of loci with 1 … x segregating sites and randomly drawing that number of loci from simulated data set (1,000 simulated loci per bin; draw with replacement) to generate null distribution consisting of the same number of loci with the same distribution of segregating sites per locus as the empirical data set.
# create a vector of stats files based on simulations
filenames <- list.files(path = "results/", pattern = "Ind60")
# create an empty list that will serve as a container to receive the incoming files
sim <-list()
# create a loop to read in your data
for (i in 1:length(filenames)){
sim[[i]] <- read_delim(file.path("results", filenames[i]), delim = "\t",
col_names = c("t1", "pi",
"t2", "seg_sites",
"t3", "tajima_d",
"t4", "thetaH",
"t5", "H")) %>%
select(pi, seg_sites, tajima_d, thetaH, H)
}
# add names
names(sim) <- filenames
# combine into dataframe
sim <- ldply(sim, data.frame) %>%
rename(simulation = `.id`) %>%
separate(simulation, into = c("temp", "t1"), sep = -6) %>%
separate(temp, into = c("N_IND", "t2", "scenario", "N_REPS"), sep = "_") %>%
mutate(N_REPS = as.numeric(str_extract(N_REPS, "[[:digit:]]+")),
N_IND = as.numeric(str_extract(N_IND, "[[:digit:]]+"))) %>%
select(-t1, -t2)
reps <- 100
# group to be compared
grp <- "F_BUT"
# get counts for number of segregating sites
emp <- read_delim("results/trib_run.hapdiv", delim = "\t") %>%
select(locus, pop, seg_sites) %>%
filter(!seg_sites == 0) %>%
filter(pop == grp) %>%
count(seg_sites)
# create null distributions
null_distributions <- list()
for (s in emp$seg_sites) {
# determine number of loci with s segregating sites
t <- emp %>%
filter(seg_sites == s)
n_loci <- t$n
# sample n_loci simulated loci (with replacement)
null_distributions[[s]] <- sim %>%
filter(seg_sites == s) %>%
sample_n(size = n_loci, replace = TRUE)
}
null_distributions <- ldply(null_distributions, data.frame) %>%
mutate(pop = grp)
taj <- read_delim("results/trib_run.hapdiv", delim = "\t") %>%
select(locus, pop, seg_sites, tajima_d) %>%
filter(!seg_sites == 0,
pop == grp) %>%
bind_rows(null_distributions) %>%
mutate(scenario = ifelse(is.na(scenario), "empirical", scenario)) %>%
select(-N_IND, -N_REPS, -pi, -thetaH, -H)
taj_all[[grp]] <- taj
Use K-S test statistic to identify simulated null distribution most similar to empirical data set (values closer to 0 indicate higher probability to have been draw from the same distribution).
Cumulative distribution curves of empirical data set (black) to simulated data set (grey). The black dashed lines indicates expected value for Tajima’s D under equilibrium assuming a beta distribution (the median should intersect at 0).
##
## Two-sample Kolmogorov-Smirnov test
##
## data: x and y
## D = 0.20954, p-value < 0.00000000000000022
## alternative hypothesis: two-sided
The maximum number of segregating sites per locus in the data set is 7, there are 27) F_FRH run individuals in the data set.
Simulate 1000 independent replicates (loci) for 1 - 7 segregating sites under neutral model for 27 individuals (54 observations).
# 2*number of individuals in data set
N=54
# number of loci to simulate
REP=1000
# simulate neutral loci
for i in {1..7}
do
scr/MS/ms $N $REP -s $i | scr/MS/sample_stats > results/Ind54_Sites${i}_neut_Rep1000.stats
done
The sample_stat script distributed along side
MS calculates summary statistics including Tajima’s \(D\) for simulated set of loci.
Generate 100 simulated data sets based on empirical data set, by determining the number of loci with 1 … x segregating sites and randomly drawing that number of loci from simulated data set (1,000 simulated loci per bin; draw with replacement) to generate null distribution consisting of the same number of loci with the same distribution of segregating sites per locus as the empirical data set.
# create a vector of stats files based on simulations
filenames <- list.files(path = "results/", pattern = "Ind54")
# create an empty list that will serve as a container to receive the incoming files
sim <-list()
# create a loop to read in your data
for (i in 1:length(filenames)){
sim[[i]] <- read_delim(file.path("results", filenames[i]), delim = "\t",
col_names = c("t1", "pi",
"t2", "seg_sites",
"t3", "tajima_d",
"t4", "thetaH",
"t5", "H")) %>%
select(pi, seg_sites, tajima_d, thetaH, H)
}
# add names
names(sim) <- filenames
# combine into dataframe
sim <- ldply(sim, data.frame) %>%
rename(simulation = `.id`) %>%
separate(simulation, into = c("temp", "t1"), sep = -6) %>%
separate(temp, into = c("N_IND", "t2", "scenario", "N_REPS"), sep = "_") %>%
mutate(N_REPS = as.numeric(str_extract(N_REPS, "[[:digit:]]+")),
N_IND = as.numeric(str_extract(N_IND, "[[:digit:]]+"))) %>%
select(-t1, -t2)
reps <- 100
# group to be compared
grp <- "F_FRH"
# get counts for number of segregating sites
emp <- read_delim("results/trib_run.hapdiv", delim = "\t") %>%
select(locus, pop, seg_sites) %>%
filter(!seg_sites == 0) %>%
filter(pop == grp) %>%
count(seg_sites)
# create null distributions
null_distributions <- list()
for (s in emp$seg_sites) {
# determine number of loci with s segregating sites
t <- emp %>%
filter(seg_sites == s)
n_loci <- t$n
# sample n_loci simulated loci (with replacement)
null_distributions[[s]] <- sim %>%
filter(seg_sites == s) %>%
sample_n(size = n_loci, replace = TRUE)
}
null_distributions <- ldply(null_distributions, data.frame) %>%
mutate(pop = grp)
taj <- read_delim("results/trib_run.hapdiv", delim = "\t") %>%
select(locus, pop, seg_sites, tajima_d) %>%
filter(!seg_sites == 0,
pop == grp) %>%
bind_rows(null_distributions) %>%
mutate(scenario = ifelse(is.na(scenario), "empirical", scenario)) %>%
select(-N_IND, -N_REPS, -pi, -thetaH, -H)
taj_all[[grp]] <- taj
Use K-S test statistic to identify simulated null distribution most similar to empirical data set (values closer to 0 indicate higher probability to have been draw from the same distribution).
Cumulative distribution curves of empirical data set (black) to simulated data set (grey). The black dashed lines indicates expected value for Tajima’s D under equilibrium assuming a beta distribution (the median should intersect at 0).
##
## Two-sample Kolmogorov-Smirnov test
##
## data: x and y
## D = 0.10915, p-value < 0.00000000000000022
## alternative hypothesis: two-sided
The maximum number of segregating sites per locus in the data set is 7, there are 30) F_NIM run individuals in the data set.
Simulate 1000 independent replicates (loci) for 1 - 7 segregating sites under neutral model for 30 individuals (60 observations).
# 2*number of individuals in data set
N=60
# number of loci to simulate
REP=1000
# simulate neutral loci
for i in {1..7}
do
scr/MS/ms $N $REP -s $i | scr/MS/sample_stats > results/Ind60_Sites${i}_neut_Rep1000.stats
done
The sample_stat script distributed along side
MS calculates summary statistics including Tajima’s \(D\) for simulated set of loci.
Generate 100 simulated data sets based on empirical data set, by determining the number of loci with 1 … x segregating sites and randomly drawing that number of loci from simulated data set (1,000 simulated loci per bin; draw with replacement) to generate null distribution consisting of the same number of loci with the same distribution of segregating sites per locus as the empirical data set.
# create a vector of stats files based on simulations
filenames <- list.files(path = "results/", pattern = "Ind60")
# create an empty list that will serve as a container to receive the incoming files
sim <-list()
# create a loop to read in your data
for (i in 1:length(filenames)){
sim[[i]] <- read_delim(file.path("results", filenames[i]), delim = "\t",
col_names = c("t1", "pi",
"t2", "seg_sites",
"t3", "tajima_d",
"t4", "thetaH",
"t5", "H")) %>%
select(pi, seg_sites, tajima_d, thetaH, H)
}
# add names
names(sim) <- filenames
# combine into dataframe
sim <- ldply(sim, data.frame) %>%
rename(simulation = `.id`) %>%
separate(simulation, into = c("temp", "t1"), sep = -6) %>%
separate(temp, into = c("N_IND", "t2", "scenario", "N_REPS"), sep = "_") %>%
mutate(N_REPS = as.numeric(str_extract(N_REPS, "[[:digit:]]+")),
N_IND = as.numeric(str_extract(N_IND, "[[:digit:]]+"))) %>%
select(-t1, -t2)
reps <- 100
# group to be compared
grp <- "F_NIM"
# get counts for number of segregating sites
emp <- read_delim("results/trib_run.hapdiv", delim = "\t") %>%
select(locus, pop, seg_sites) %>%
filter(!seg_sites == 0) %>%
filter(pop == grp) %>%
count(seg_sites)
# create null distributions
null_distributions <- list()
for (s in emp$seg_sites) {
# determine number of loci with s segregating sites
t <- emp %>%
filter(seg_sites == s)
n_loci <- t$n
# sample n_loci simulated loci (with replacement)
null_distributions[[s]] <- sim %>%
filter(seg_sites == s) %>%
sample_n(size = n_loci, replace = TRUE)
}
null_distributions <- ldply(null_distributions, data.frame) %>%
mutate(pop = grp)
taj <- read_delim("results/trib_run.hapdiv", delim = "\t") %>%
select(locus, pop, seg_sites, tajima_d) %>%
filter(!seg_sites == 0,
pop == grp) %>%
bind_rows(null_distributions) %>%
mutate(scenario = ifelse(is.na(scenario), "empirical", scenario)) %>%
select(-N_IND, -N_REPS, -pi, -thetaH, -H)
taj_all[[grp]] <- taj
Use K-S test statistic to identify simulated null distribution most similar to empirical data set (values closer to 0 indicate higher probability to have been draw from the same distribution).
Cumulative distribution curves of empirical data set (black) to simulated data set (grey). The black dashed lines indicates expected value for Tajima’s D under equilibrium assuming a beta distribution (the median should intersect at 0).
##
## Two-sample Kolmogorov-Smirnov test
##
## data: x and y
## D = 0.16563, p-value < 0.00000000000000022
## alternative hypothesis: two-sided
The maximum number of segregating sites per locus in the data set is 7, there are 28) F_MKH run individuals in the data set.
Simulate 1000 independent replicates (loci) for 1 - 7 segregating sites under neutral model for 28 individuals (56 observations).
# 2*number of individuals in data set
N=56
# number of loci to simulate
REP=1000
# simulate neutral loci
for i in {1..7}
do
scr/MS/ms $N $REP -s $i | scr/MS/sample_stats > results/Ind56_Sites${i}_neut_Rep1000.stats
done
The sample_stat script distributed along side
MS calculates summary statistics including Tajima’s \(D\) for simulated set of loci.
Generate 100 simulated data sets based on empirical data set, by determining the number of loci with 1 … x segregating sites and randomly drawing that number of loci from simulated data set (1,000 simulated loci per bin; draw with replacement) to generate null distribution consisting of the same number of loci with the same distribution of segregating sites per locus as the empirical data set.
# create a vector of stats files based on simulations
filenames <- list.files(path = "results/", pattern = "Ind56")
# create an empty list that will serve as a container to receive the incoming files
sim <-list()
# create a loop to read in your data
for (i in 1:length(filenames)){
sim[[i]] <- read_delim(file.path("results", filenames[i]), delim = "\t",
col_names = c("t1", "pi",
"t2", "seg_sites",
"t3", "tajima_d",
"t4", "thetaH",
"t5", "H")) %>%
select(pi, seg_sites, tajima_d, thetaH, H)
}
# add names
names(sim) <- filenames
# combine into dataframe
sim <- ldply(sim, data.frame) %>%
rename(simulation = `.id`) %>%
separate(simulation, into = c("temp", "t1"), sep = -6) %>%
separate(temp, into = c("N_IND", "t2", "scenario", "N_REPS"), sep = "_") %>%
mutate(N_REPS = as.numeric(str_extract(N_REPS, "[[:digit:]]+")),
N_IND = as.numeric(str_extract(N_IND, "[[:digit:]]+"))) %>%
select(-t1, -t2)
reps <- 100
# group to be compared
grp <- "F_MKH"
# get counts for number of segregating sites
emp <- read_delim("results/trib_run.hapdiv", delim = "\t") %>%
select(locus, pop, seg_sites) %>%
filter(!seg_sites == 0) %>%
filter(pop == grp) %>%
count(seg_sites)
# create null distributions
null_distributions <- list()
for (s in emp$seg_sites) {
# determine number of loci with s segregating sites
t <- emp %>%
filter(seg_sites == s)
n_loci <- t$n
# sample n_loci simulated loci (with replacement)
null_distributions[[s]] <- sim %>%
filter(seg_sites == s) %>%
sample_n(size = n_loci, replace = TRUE)
}
null_distributions <- ldply(null_distributions, data.frame) %>%
mutate(pop = grp)
taj <- read_delim("results/trib_run.hapdiv", delim = "\t") %>%
select(locus, pop, seg_sites, tajima_d) %>%
filter(!seg_sites == 0,
pop == grp) %>%
bind_rows(null_distributions) %>%
mutate(scenario = ifelse(is.na(scenario), "empirical", scenario)) %>%
select(-N_IND, -N_REPS, -pi, -thetaH, -H)
taj_all[[grp]] <- taj
Use K-S test statistic to identify simulated null distribution most similar to empirical data set (values closer to 0 indicate higher probability to have been draw from the same distribution).
Cumulative distribution curves of empirical data set (black) to simulated data set (grey). The black dashed lines indicates expected value for Tajima’s D under equilibrium assuming a beta distribution (the median should intersect at 0).
##
## Two-sample Kolmogorov-Smirnov test
##
## data: x and y
## D = 0.16758, p-value < 0.00000000000000022
## alternative hypothesis: two-sided
The maximum number of segregating sites per locus in the data set is 7, there are 23) F_STN run individuals in the data set.
Simulate 1000 independent replicates (loci) for 1 - 7 segregating sites under neutral model for 23 individuals (46 observations).
# 2*number of individuals in data set
N=46
# number of loci to simulate
REP=1000
# simulate neutral loci
for i in {1..7}
do
scr/MS/ms $N $REP -s $i | scr/MS/sample_stats > results/Ind46_Sites${i}_neut_Rep1000.stats
done
The sample_stat script distributed along side
MS calculates summary statistics including Tajima’s \(D\) for simulated set of loci.
Generate 100 simulated data sets based on empirical data set, by determining the number of loci with 1 … x segregating sites and randomly drawing that number of loci from simulated data set (1,000 simulated loci per bin; draw with replacement) to generate null distribution consisting of the same number of loci with the same distribution of segregating sites per locus as the empirical data set.
# create a vector of stats files based on simulations
filenames <- list.files(path = "results/", pattern = "Ind46")
# create an empty list that will serve as a container to receive the incoming files
sim <-list()
# create a loop to read in your data
for (i in 1:length(filenames)){
sim[[i]] <- read_delim(file.path("results", filenames[i]), delim = "\t",
col_names = c("t1", "pi",
"t2", "seg_sites",
"t3", "tajima_d",
"t4", "thetaH",
"t5", "H")) %>%
select(pi, seg_sites, tajima_d, thetaH, H)
}
# add names
names(sim) <- filenames
# combine into dataframe
sim <- ldply(sim, data.frame) %>%
rename(simulation = `.id`) %>%
separate(simulation, into = c("temp", "t1"), sep = -6) %>%
separate(temp, into = c("N_IND", "t2", "scenario", "N_REPS"), sep = "_") %>%
mutate(N_REPS = as.numeric(str_extract(N_REPS, "[[:digit:]]+")),
N_IND = as.numeric(str_extract(N_IND, "[[:digit:]]+"))) %>%
select(-t1, -t2)
reps <- 100
# group to be compared
grp <- "F_STN"
# get counts for number of segregating sites
emp <- read_delim("results/trib_run.hapdiv", delim = "\t") %>%
select(locus, pop, seg_sites) %>%
filter(!seg_sites == 0) %>%
filter(pop == grp) %>%
count(seg_sites)
# create null distributions
null_distributions <- list()
for (s in emp$seg_sites) {
# determine number of loci with s segregating sites
t <- emp %>%
filter(seg_sites == s)
n_loci <- t$n
# sample n_loci simulated loci (with replacement)
null_distributions[[s]] <- sim %>%
filter(seg_sites == s) %>%
sample_n(size = n_loci, replace = TRUE)
}
null_distributions <- ldply(null_distributions, data.frame) %>%
mutate(pop = grp)
taj <- read_delim("results/trib_run.hapdiv", delim = "\t") %>%
select(locus, pop, seg_sites, tajima_d) %>%
filter(!seg_sites == 0,
pop == grp) %>%
bind_rows(null_distributions) %>%
mutate(scenario = ifelse(is.na(scenario), "empirical", scenario)) %>%
select(-N_IND, -N_REPS, -pi, -thetaH, -H)
taj_all[[grp]] <- taj
Use K-S test statistic to identify simulated null distribution most similar to empirical data set (values closer to 0 indicate higher probability to have been draw from the same distribution).
Cumulative distribution curves of empirical data set (black) to simulated data set (grey). The black dashed lines indicates expected value for Tajima’s D under equilibrium assuming a beta distribution (the median should intersect at 0).
##
## Two-sample Kolmogorov-Smirnov test
##
## data: x and y
## D = 0.19666, p-value < 0.00000000000000022
## alternative hypothesis: two-sided
The maximum number of segregating sites per locus in the data set is 7, there are 30) F_TOU run individuals in the data set.
Simulate 1000 independent replicates (loci) for 1 - 7 segregating sites under neutral model for 30 individuals (60 observations).
# 2*number of individuals in data set
N=60
# number of loci to simulate
REP=1000
# simulate neutral loci
for i in {1..7}
do
scr/MS/ms $N $REP -s $i | scr/MS/sample_stats > results/Ind60_Sites${i}_neut_Rep1000.stats
done
The sample_stat script distributed along side
MS calculates summary statistics including Tajima’s \(D\) for simulated set of loci.
Generate 100 simulated data sets based on empirical data set, by determining the number of loci with 1 … x segregating sites and randomly drawing that number of loci from simulated data set (1,000 simulated loci per bin; draw with replacement) to generate null distribution consisting of the same number of loci with the same distribution of segregating sites per locus as the empirical data set.
# create a vector of stats files based on simulations
filenames <- list.files(path = "results/", pattern = "Ind60")
# create an empty list that will serve as a container to receive the incoming files
sim <-list()
# create a loop to read in your data
for (i in 1:length(filenames)){
sim[[i]] <- read_delim(file.path("results", filenames[i]), delim = "\t",
col_names = c("t1", "pi",
"t2", "seg_sites",
"t3", "tajima_d",
"t4", "thetaH",
"t5", "H")) %>%
select(pi, seg_sites, tajima_d, thetaH, H)
}
# add names
names(sim) <- filenames
# combine into dataframe
sim <- ldply(sim, data.frame) %>%
rename(simulation = `.id`) %>%
separate(simulation, into = c("temp", "t1"), sep = -6) %>%
separate(temp, into = c("N_IND", "t2", "scenario", "N_REPS"), sep = "_") %>%
mutate(N_REPS = as.numeric(str_extract(N_REPS, "[[:digit:]]+")),
N_IND = as.numeric(str_extract(N_IND, "[[:digit:]]+"))) %>%
select(-t1, -t2)
reps <- 100
# group to be compared
grp <- "F_TOU"
# get counts for number of segregating sites
emp <- read_delim("results/trib_run.hapdiv", delim = "\t") %>%
select(locus, pop, seg_sites) %>%
filter(!seg_sites == 0) %>%
filter(pop == grp) %>%
count(seg_sites)
# create null distributions
null_distributions <- list()
for (s in emp$seg_sites) {
# determine number of loci with s segregating sites
t <- emp %>%
filter(seg_sites == s)
n_loci <- t$n
# sample n_loci simulated loci (with replacement)
null_distributions[[s]] <- sim %>%
filter(seg_sites == s) %>%
sample_n(size = n_loci, replace = TRUE)
}
null_distributions <- ldply(null_distributions, data.frame) %>%
mutate(pop = grp)
taj <- read_delim("results/trib_run.hapdiv", delim = "\t") %>%
select(locus, pop, seg_sites, tajima_d) %>%
filter(!seg_sites == 0,
pop == grp) %>%
bind_rows(null_distributions) %>%
mutate(scenario = ifelse(is.na(scenario), "empirical", scenario)) %>%
select(-N_IND, -N_REPS, -pi, -thetaH, -H)
taj_all[[grp]] <- taj
Use K-S test statistic to identify simulated null distribution most similar to empirical data set (values closer to 0 indicate higher probability to have been draw from the same distribution).
Cumulative distribution curves of empirical data set (black) to simulated data set (grey). The black dashed lines indicates expected value for Tajima’s D under equilibrium assuming a beta distribution (the median should intersect at 0).
##
## Two-sample Kolmogorov-Smirnov test
##
## data: x and y
## D = 0.1764, p-value < 0.00000000000000022
## alternative hypothesis: two-sided
The maximum number of segregating sites per locus in the data set is 7, there are 15) F_MRH run individuals in the data set.
Simulate 1000 independent replicates (loci) for 1 - 7 segregating sites under neutral model for 15 individuals (30 observations).
# 2*number of individuals in data set
N=30
# number of loci to simulate
REP=1000
# simulate neutral loci
for i in {1..7}
do
scr/MS/ms $N $REP -s $i | scr/MS/sample_stats > results/Ind30_Sites${i}_neut_Rep1000.stats
done
The sample_stat script distributed along side
MS calculates summary statistics including Tajima’s \(D\) for simulated set of loci.
Generate 100 simulated data sets based on empirical data set, by determining the number of loci with 1 … x segregating sites and randomly drawing that number of loci from simulated data set (1,000 simulated loci per bin; draw with replacement) to generate null distribution consisting of the same number of loci with the same distribution of segregating sites per locus as the empirical data set.
# create a vector of stats files based on simulations
filenames <- list.files(path = "results/", pattern = "Ind30")
# create an empty list that will serve as a container to receive the incoming files
sim <-list()
# create a loop to read in your data
for (i in 1:length(filenames)){
sim[[i]] <- read_delim(file.path("results", filenames[i]), delim = "\t",
col_names = c("t1", "pi",
"t2", "seg_sites",
"t3", "tajima_d",
"t4", "thetaH",
"t5", "H")) %>%
select(pi, seg_sites, tajima_d, thetaH, H)
}
# add names
names(sim) <- filenames
# combine into dataframe
sim <- ldply(sim, data.frame) %>%
rename(simulation = `.id`) %>%
separate(simulation, into = c("temp", "t1"), sep = -6) %>%
separate(temp, into = c("N_IND", "t2", "scenario", "N_REPS"), sep = "_") %>%
mutate(N_REPS = as.numeric(str_extract(N_REPS, "[[:digit:]]+")),
N_IND = as.numeric(str_extract(N_IND, "[[:digit:]]+"))) %>%
select(-t1, -t2)
reps <- 100
# group to be compared
grp <- "F_MRH"
# get counts for number of segregating sites
emp <- read_delim("results/trib_run.hapdiv", delim = "\t") %>%
select(locus, pop, seg_sites) %>%
filter(!seg_sites == 0) %>%
filter(pop == grp) %>%
count(seg_sites)
# create null distributions
null_distributions <- list()
for (s in emp$seg_sites) {
# determine number of loci with s segregating sites
t <- emp %>%
filter(seg_sites == s)
n_loci <- t$n
# sample n_loci simulated loci (with replacement)
null_distributions[[s]] <- sim %>%
filter(seg_sites == s) %>%
sample_n(size = n_loci, replace = TRUE)
}
null_distributions <- ldply(null_distributions, data.frame) %>%
mutate(pop = grp)
taj <- read_delim("results/trib_run.hapdiv", delim = "\t") %>%
select(locus, pop, seg_sites, tajima_d) %>%
filter(!seg_sites == 0,
pop == grp) %>%
bind_rows(null_distributions) %>%
mutate(scenario = ifelse(is.na(scenario), "empirical", scenario)) %>%
select(-N_IND, -N_REPS, -pi, -thetaH, -H)
taj_all[[grp]] <- taj
Use K-S test statistic to identify simulated null distribution most similar to empirical data set (values closer to 0 indicate higher probability to have been draw from the same distribution).
Cumulative distribution curves of empirical data set (black) to simulated data set (grey). The black dashed lines indicates expected value for Tajima’s D under equilibrium assuming a beta distribution (the median should intersect at 0).
##
## Two-sample Kolmogorov-Smirnov test
##
## data: x and y
## D = 0.19097, p-value < 0.00000000000000022
## alternative hypothesis: two-sided
The maximum number of segregating sites per locus in the data set is 7, there are 31) F_MER run individuals in the data set.
Simulate 1000 independent replicates (loci) for 1 - 7 segregating sites under neutral model for 31 individuals (62 observations).
# 2*number of individuals in data set
N=62
# number of loci to simulate
REP=1000
# simulate neutral loci
for i in {1..7}
do
scr/MS/ms $N $REP -s $i | scr/MS/sample_stats > results/Ind62_Sites${i}_neut_Rep1000.stats
done
The sample_stat script distributed along side
MS calculates summary statistics including Tajima’s \(D\) for simulated set of loci.
Generate 100 simulated data sets based on empirical data set, by determining the number of loci with 1 … x segregating sites and randomly drawing that number of loci from simulated data set (1,000 simulated loci per bin; draw with replacement) to generate null distribution consisting of the same number of loci with the same distribution of segregating sites per locus as the empirical data set.
# create a vector of stats files based on simulations
filenames <- list.files(path = "results/", pattern = "Ind62")
# create an empty list that will serve as a container to receive the incoming files
sim <-list()
# create a loop to read in your data
for (i in 1:length(filenames)){
sim[[i]] <- read_delim(file.path("results", filenames[i]), delim = "\t",
col_names = c("t1", "pi",
"t2", "seg_sites",
"t3", "tajima_d",
"t4", "thetaH",
"t5", "H")) %>%
select(pi, seg_sites, tajima_d, thetaH, H)
}
# add names
names(sim) <- filenames
# combine into dataframe
sim <- ldply(sim, data.frame) %>%
rename(simulation = `.id`) %>%
separate(simulation, into = c("temp", "t1"), sep = -6) %>%
separate(temp, into = c("N_IND", "t2", "scenario", "N_REPS"), sep = "_") %>%
mutate(N_REPS = as.numeric(str_extract(N_REPS, "[[:digit:]]+")),
N_IND = as.numeric(str_extract(N_IND, "[[:digit:]]+"))) %>%
select(-t1, -t2)
reps <- 100
# group to be compared
grp <- "F_MER"
# get counts for number of segregating sites
emp <- read_delim("results/trib_run.hapdiv", delim = "\t") %>%
select(locus, pop, seg_sites) %>%
filter(!seg_sites == 0) %>%
filter(pop == grp) %>%
count(seg_sites)
# create null distributions
null_distributions <- list()
for (s in emp$seg_sites) {
# determine number of loci with s segregating sites
t <- emp %>%
filter(seg_sites == s)
n_loci <- t$n
# sample n_loci simulated loci (with replacement)
null_distributions[[s]] <- sim %>%
filter(seg_sites == s) %>%
sample_n(size = n_loci, replace = TRUE)
}
null_distributions <- ldply(null_distributions, data.frame) %>%
mutate(pop = grp)
taj <- read_delim("results/trib_run.hapdiv", delim = "\t") %>%
select(locus, pop, seg_sites, tajima_d) %>%
filter(!seg_sites == 0,
pop == grp) %>%
bind_rows(null_distributions) %>%
mutate(scenario = ifelse(is.na(scenario), "empirical", scenario)) %>%
select(-N_IND, -N_REPS, -pi, -thetaH, -H)
taj_all[[grp]] <- taj
Use K-S test statistic to identify simulated null distribution most similar to empirical data set (values closer to 0 indicate higher probability to have been draw from the same distribution).
Cumulative distribution curves of empirical data set (black) to simulated data set (grey). The black dashed lines indicates expected value for Tajima’s D under equilibrium assuming a beta distribution (the median should intersect at 0).
##
## Two-sample Kolmogorov-Smirnov test
##
## data: x and y
## D = 0.16428, p-value < 0.00000000000000022
## alternative hypothesis: two-sided
The maximum number of segregating sites per locus in the data set is 7, there are 21) L_USR run individuals in the data set.
Simulate 1000 independent replicates (loci) for 1 - 7 segregating sites under neutral model for 21 individuals (42 observations).
# 2*number of individuals in data set
N=42
# number of loci to simulate
REP=1000
# simulate neutral loci
for i in {1..7}
do
scr/MS/ms $N $REP -s $i | scr/MS/sample_stats > results/Ind42_Sites${i}_neut_Rep1000.stats
done
The sample_stat script distributed along side
MS calculates summary statistics including Tajima’s \(D\) for simulated set of loci.
Generate 100 simulated data sets based on empirical data set, by determining the number of loci with 1 … x segregating sites and randomly drawing that number of loci from simulated data set (1,000 simulated loci per bin; draw with replacement) to generate null distribution consisting of the same number of loci with the same distribution of segregating sites per locus as the empirical data set.
# create a vector of stats files based on simulations
filenames <- list.files(path = "results/", pattern = "Ind42")
# create an empty list that will serve as a container to receive the incoming files
sim <-list()
# create a loop to read in your data
for (i in 1:length(filenames)){
sim[[i]] <- read_delim(file.path("results", filenames[i]), delim = "\t",
col_names = c("t1", "pi",
"t2", "seg_sites",
"t3", "tajima_d",
"t4", "thetaH",
"t5", "H")) %>%
select(pi, seg_sites, tajima_d, thetaH, H)
}
# add names
names(sim) <- filenames
# combine into dataframe
sim <- ldply(sim, data.frame) %>%
rename(simulation = `.id`) %>%
separate(simulation, into = c("temp", "t1"), sep = -6) %>%
separate(temp, into = c("N_IND", "t2", "scenario", "N_REPS"), sep = "_") %>%
mutate(N_REPS = as.numeric(str_extract(N_REPS, "[[:digit:]]+")),
N_IND = as.numeric(str_extract(N_IND, "[[:digit:]]+"))) %>%
select(-t1, -t2)
reps <- 100
# group to be compared
grp <- "L_USR"
# get counts for number of segregating sites
emp <- read_delim("results/trib_run.hapdiv", delim = "\t") %>%
select(locus, pop, seg_sites) %>%
filter(!seg_sites == 0) %>%
filter(pop == grp) %>%
count(seg_sites)
# create null distributions
null_distributions <- list()
for (s in emp$seg_sites) {
# determine number of loci with s segregating sites
t <- emp %>%
filter(seg_sites == s)
n_loci <- t$n
# sample n_loci simulated loci (with replacement)
null_distributions[[s]] <- sim %>%
filter(seg_sites == s) %>%
sample_n(size = n_loci, replace = TRUE)
}
null_distributions <- ldply(null_distributions, data.frame) %>%
mutate(pop = grp)
taj <- read_delim("results/trib_run.hapdiv", delim = "\t") %>%
select(locus, pop, seg_sites, tajima_d) %>%
filter(!seg_sites == 0,
pop == grp) %>%
bind_rows(null_distributions) %>%
mutate(scenario = ifelse(is.na(scenario), "empirical", scenario)) %>%
select(-N_IND, -N_REPS, -pi, -thetaH, -H)
taj_all[[grp]] <- taj
Use K-S test statistic to identify simulated null distribution most similar to empirical data set (values closer to 0 indicate higher probability to have been draw from the same distribution).
Cumulative distribution curves of empirical data set (black) to simulated data set (grey). The black dashed lines indicates expected value for Tajima’s D under equilibrium assuming a beta distribution (the median should intersect at 0).
##
## Two-sample Kolmogorov-Smirnov test
##
## data: x and y
## D = 0.12979, p-value < 0.00000000000000022
## alternative hypothesis: two-sided
The maximum number of segregating sites per locus in the data set is 7, there are 26) W_USR run individuals in the data set.
Simulate 1000 independent replicates (loci) for 1 - 7 segregating sites under neutral model for 26 individuals (52 observations).
# 2*number of individuals in data set
N=52
# number of loci to simulate
REP=1000
# simulate neutral loci
for i in {1..7}
do
scr/MS/ms $N $REP -s $i | scr/MS/sample_stats > results/Ind52_Sites${i}_neut_Rep1000.stats
done
The sample_stat script distributed along side
MS calculates summary statistics including Tajima’s \(D\) for simulated set of loci.
Generate 100 simulated data sets based on empirical data set, by determining the number of loci with 1 … x segregating sites and randomly drawing that number of loci from simulated data set (1,000 simulated loci per bin; draw with replacement) to generate null distribution consisting of the same number of loci with the same distribution of segregating sites per locus as the empirical data set.
# create a vector of stats files based on simulations
filenames <- list.files(path = "results/", pattern = "Ind42")
# create an empty list that will serve as a container to receive the incoming files
sim <-list()
# create a loop to read in your data
for (i in 1:length(filenames)){
sim[[i]] <- read_delim(file.path("results", filenames[i]), delim = "\t",
col_names = c("t1", "pi",
"t2", "seg_sites",
"t3", "tajima_d",
"t4", "thetaH",
"t5", "H")) %>%
select(pi, seg_sites, tajima_d, thetaH, H)
}
# add names
names(sim) <- filenames
# combine into dataframe
sim <- ldply(sim, data.frame) %>%
rename(simulation = `.id`) %>%
separate(simulation, into = c("temp", "t1"), sep = -6) %>%
separate(temp, into = c("N_IND", "t2", "scenario", "N_REPS"), sep = "_") %>%
mutate(N_REPS = as.numeric(str_extract(N_REPS, "[[:digit:]]+")),
N_IND = as.numeric(str_extract(N_IND, "[[:digit:]]+"))) %>%
select(-t1, -t2)
reps <- 100
# group to be compared
grp <- "W_USR"
# get counts for number of segregating sites
emp <- read_delim("results/trib_run.hapdiv", delim = "\t") %>%
select(locus, pop, seg_sites) %>%
filter(!seg_sites == 0) %>%
filter(pop == grp) %>%
count(seg_sites)
# create null distributions
null_distributions <- list()
for (s in emp$seg_sites) {
# determine number of loci with s segregating sites
t <- emp %>%
filter(seg_sites == s)
n_loci <- t$n
# sample n_loci simulated loci (with replacement)
null_distributions[[s]] <- sim %>%
filter(seg_sites == s) %>%
sample_n(size = n_loci, replace = TRUE)
}
null_distributions <- ldply(null_distributions, data.frame) %>%
mutate(pop = grp)
taj <- read_delim("results/trib_run.hapdiv", delim = "\t") %>%
select(locus, pop, seg_sites, tajima_d) %>%
filter(!seg_sites == 0,
pop == grp) %>%
bind_rows(null_distributions) %>%
mutate(scenario = ifelse(is.na(scenario), "empirical", scenario)) %>%
select(-N_IND, -N_REPS, -pi, -thetaH, -H)
taj_all[[grp]] <- taj
Use K-S test statistic to identify simulated null distribution most similar to empirical data set (values closer to 0 indicate higher probability to have been draw from the same distribution).
Cumulative distribution curves of empirical data set (black) to simulated data set (grey). The black dashed lines indicates expected value for Tajima’s D under equilibrium assuming a beta distribution (the median should intersect at 0).
##
## Two-sample Kolmogorov-Smirnov test
##
## data: x and y
## D = 0.21578, p-value < 0.00000000000000022
## alternative hypothesis: two-sided
The maximum number of segregating sites per locus in the data set is 7, there are 16) S_MIL run individuals in the data set.
Simulate 1000 independent replicates (loci) for 1 - 7 segregating sites under neutral model for 16 individuals (32 observations).
# 2*number of individuals in data set
N=32
# number of loci to simulate
REP=1000
# simulate neutral loci
for i in {1..7}
do
scr/MS/ms $N $REP -s $i | scr/MS/sample_stats > results/Ind32_Sites${i}_neut_Rep1000.stats
done
The sample_stat script distributed along side
MS calculates summary statistics including Tajima’s \(D\) for simulated set of loci.
Generate 100 simulated data sets based on empirical data set, by determining the number of loci with 1 … x segregating sites and randomly drawing that number of loci from simulated data set (1,000 simulated loci per bin; draw with replacement) to generate null distribution consisting of the same number of loci with the same distribution of segregating sites per locus as the empirical data set.
# create a vector of stats files based on simulations
filenames <- list.files(path = "results/", pattern = "Ind32")
# create an empty list that will serve as a container to receive the incoming files
sim <-list()
# create a loop to read in your data
for (i in 1:length(filenames)){
sim[[i]] <- read_delim(file.path("results", filenames[i]), delim = "\t",
col_names = c("t1", "pi",
"t2", "seg_sites",
"t3", "tajima_d",
"t4", "thetaH",
"t5", "H")) %>%
select(pi, seg_sites, tajima_d, thetaH, H)
}
# add names
names(sim) <- filenames
# combine into dataframe
sim <- ldply(sim, data.frame) %>%
rename(simulation = `.id`) %>%
separate(simulation, into = c("temp", "t1"), sep = -6) %>%
separate(temp, into = c("N_IND", "t2", "scenario", "N_REPS"), sep = "_") %>%
mutate(N_REPS = as.numeric(str_extract(N_REPS, "[[:digit:]]+")),
N_IND = as.numeric(str_extract(N_IND, "[[:digit:]]+"))) %>%
select(-t1, -t2)
reps <- 100
# group to be compared
grp <- "S_MIL"
# get counts for number of segregating sites
emp <- read_delim("results/trib_run.hapdiv", delim = "\t") %>%
select(locus, pop, seg_sites) %>%
filter(!seg_sites == 0) %>%
filter(pop == grp) %>%
count(seg_sites)
# create null distributions
null_distributions <- list()
for (s in emp$seg_sites) {
# determine number of loci with s segregating sites
t <- emp %>%
filter(seg_sites == s)
n_loci <- t$n
# sample n_loci simulated loci (with replacement)
null_distributions[[s]] <- sim %>%
filter(seg_sites == s) %>%
sample_n(size = n_loci, replace = TRUE)
}
null_distributions <- ldply(null_distributions, data.frame) %>%
mutate(pop = grp)
taj <- read_delim("results/trib_run.hapdiv", delim = "\t") %>%
select(locus, pop, seg_sites, tajima_d) %>%
filter(!seg_sites == 0,
pop == grp) %>%
bind_rows(null_distributions) %>%
mutate(scenario = ifelse(is.na(scenario), "empirical", scenario)) %>%
select(-N_IND, -N_REPS, -pi, -thetaH, -H)
taj_all[[grp]] <- taj
Use K-S test statistic to identify simulated null distribution most similar to empirical data set (values closer to 0 indicate higher probability to have been draw from the same distribution).
Cumulative distribution curves of empirical data set (black) to simulated data set (grey). The black dashed lines indicates expected value for Tajima’s D under equilibrium assuming a beta distribution (the median should intersect at 0).
##
## Two-sample Kolmogorov-Smirnov test
##
## data: x and y
## D = 0.16933, p-value < 0.00000000000000022
## alternative hypothesis: two-sided
The maximum number of segregating sites per locus in the data set is 7, there are 27) S_DER run individuals in the data set.
Simulate 1000 independent replicates (loci) for 1 - 7 segregating sites under neutral model for 27 individuals (54 observations).
# 2*number of individuals in data set
N=54
# number of loci to simulate
REP=1000
# simulate neutral loci
for i in {1..7}
do
scr/MS/ms $N $REP -s $i | scr/MS/sample_stats > results/Ind54_Sites${i}_neut_Rep1000.stats
done
The sample_stat script distributed along side
MS calculates summary statistics including Tajima’s \(D\) for simulated set of loci.
Generate 100 simulated data sets based on empirical data set, by determining the number of loci with 1 … x segregating sites and randomly drawing that number of loci from simulated data set (1,000 simulated loci per bin; draw with replacement) to generate null distribution consisting of the same number of loci with the same distribution of segregating sites per locus as the empirical data set.
# create a vector of stats files based on simulations
filenames <- list.files(path = "results/", pattern = "Ind54")
# create an empty list that will serve as a container to receive the incoming files
sim <-list()
# create a loop to read in your data
for (i in 1:length(filenames)){
sim[[i]] <- read_delim(file.path("results", filenames[i]), delim = "\t",
col_names = c("t1", "pi",
"t2", "seg_sites",
"t3", "tajima_d",
"t4", "thetaH",
"t5", "H")) %>%
select(pi, seg_sites, tajima_d, thetaH, H)
}
# add names
names(sim) <- filenames
# combine into dataframe
sim <- ldply(sim, data.frame) %>%
rename(simulation = `.id`) %>%
separate(simulation, into = c("temp", "t1"), sep = -6) %>%
separate(temp, into = c("N_IND", "t2", "scenario", "N_REPS"), sep = "_") %>%
mutate(N_REPS = as.numeric(str_extract(N_REPS, "[[:digit:]]+")),
N_IND = as.numeric(str_extract(N_IND, "[[:digit:]]+"))) %>%
select(-t1, -t2)
reps <- 100
# group to be compared
grp <- "S_DER"
# get counts for number of segregating sites
emp <- read_delim("results/trib_run.hapdiv", delim = "\t") %>%
select(locus, pop, seg_sites) %>%
filter(!seg_sites == 0) %>%
filter(pop == grp) %>%
count(seg_sites)
# create null distributions
null_distributions <- list()
for (s in emp$seg_sites) {
# determine number of loci with s segregating sites
t <- emp %>%
filter(seg_sites == s)
n_loci <- t$n
# sample n_loci simulated loci (with replacement)
null_distributions[[s]] <- sim %>%
filter(seg_sites == s) %>%
sample_n(size = n_loci, replace = TRUE)
}
null_distributions <- ldply(null_distributions, data.frame) %>%
mutate(pop = grp)
taj <- read_delim("results/trib_run.hapdiv", delim = "\t") %>%
select(locus, pop, seg_sites, tajima_d) %>%
filter(!seg_sites == 0,
pop == grp) %>%
bind_rows(null_distributions) %>%
mutate(scenario = ifelse(is.na(scenario), "empirical", scenario)) %>%
select(-N_IND, -N_REPS, -pi, -thetaH, -H)
taj_all[[grp]] <- taj
Use K-S test statistic to identify simulated null distribution most similar to empirical data set (values closer to 0 indicate higher probability to have been draw from the same distribution).
Cumulative distribution curves of empirical data set (black) to simulated data set (grey). The black dashed lines indicates expected value for Tajima’s D under equilibrium assuming a beta distribution (the median should intersect at 0).
##
## Two-sample Kolmogorov-Smirnov test
##
## data: x and y
## D = 0.14984, p-value < 0.00000000000000022
## alternative hypothesis: two-sided
The maximum number of segregating sites per locus in the data set is 7, there are 19) S_BUT run individuals in the data set.
Simulate 1000 independent replicates (loci) for 1 - 7 segregating sites under neutral model for 19 individuals (38 observations).
# 2*number of individuals in data set
N=38
# number of loci to simulate
REP=1000
# simulate neutral loci
for i in {1..7}
do
scr/MS/ms $N $REP -s $i | scr/MS/sample_stats > results/Ind38_Sites${i}_neut_Rep1000.stats
done
The sample_stat script distributed along side
MS calculates summary statistics including Tajima’s \(D\) for simulated set of loci.
Generate 100 simulated data sets based on empirical data set, by determining the number of loci with 1 … x segregating sites and randomly drawing that number of loci from simulated data set (1,000 simulated loci per bin; draw with replacement) to generate null distribution consisting of the same number of loci with the same distribution of segregating sites per locus as the empirical data set.
# create a vector of stats files based on simulations
filenames <- list.files(path = "results/", pattern = "Ind38")
# create an empty list that will serve as a container to receive the incoming files
sim <-list()
# create a loop to read in your data
for (i in 1:length(filenames)){
sim[[i]] <- read_delim(file.path("results", filenames[i]), delim = "\t",
col_names = c("t1", "pi",
"t2", "seg_sites",
"t3", "tajima_d",
"t4", "thetaH",
"t5", "H")) %>%
select(pi, seg_sites, tajima_d, thetaH, H)
}
# add names
names(sim) <- filenames
# combine into dataframe
sim <- ldply(sim, data.frame) %>%
rename(simulation = `.id`) %>%
separate(simulation, into = c("temp", "t1"), sep = -6) %>%
separate(temp, into = c("N_IND", "t2", "scenario", "N_REPS"), sep = "_") %>%
mutate(N_REPS = as.numeric(str_extract(N_REPS, "[[:digit:]]+")),
N_IND = as.numeric(str_extract(N_IND, "[[:digit:]]+"))) %>%
select(-t1, -t2)
reps <- 100
# group to be compared
grp <- "S_BUT"
# get counts for number of segregating sites
emp <- read_delim("results/trib_run.hapdiv", delim = "\t") %>%
select(locus, pop, seg_sites) %>%
filter(!seg_sites == 0) %>%
filter(pop == grp) %>%
count(seg_sites)
# create null distributions
null_distributions <- list()
for (s in emp$seg_sites) {
# determine number of loci with s segregating sites
t <- emp %>%
filter(seg_sites == s)
n_loci <- t$n
# sample n_loci simulated loci (with replacement)
null_distributions[[s]] <- sim %>%
filter(seg_sites == s) %>%
sample_n(size = n_loci, replace = TRUE)
}
null_distributions <- ldply(null_distributions, data.frame) %>%
mutate(pop = grp)
taj <- read_delim("results/trib_run.hapdiv", delim = "\t") %>%
select(locus, pop, seg_sites, tajima_d) %>%
filter(!seg_sites == 0,
pop == grp) %>%
bind_rows(null_distributions) %>%
mutate(scenario = ifelse(is.na(scenario), "empirical", scenario)) %>%
select(-N_IND, -N_REPS, -pi, -thetaH, -H)
taj_all[[grp]] <- taj
Use K-S test statistic to identify simulated null distribution most similar to empirical data set (values closer to 0 indicate higher probability to have been draw from the same distribution).
Cumulative distribution curves of empirical data set (black) to simulated data set (grey). The black dashed lines indicates expected value for Tajima’s D under equilibrium assuming a beta distribution (the median should intersect at 0).
##
## Two-sample Kolmogorov-Smirnov test
##
## data: x and y
## D = 0.15249, p-value < 0.00000000000000022
## alternative hypothesis: two-sided
The maximum number of segregating sites per locus in the data set is 7, there are 7) S_FRH run individuals in the data set.
Simulate 1000 independent replicates (loci) for 1 - 7 segregating sites under neutral model for 7 individuals (14 observations).
# 2*number of individuals in data set
N=14
# number of loci to simulate
REP=1000
# simulate neutral loci
for i in {1..7}
do
scr/MS/ms $N $REP -s $i | scr/MS/sample_stats > results/Ind14_Sites${i}_neut_Rep1000.stats
done
The sample_stat script distributed along side
MS calculates summary statistics including Tajima’s \(D\) for simulated set of loci.
Generate 100 simulated data sets based on empirical data set, by determining the number of loci with 1 … x segregating sites and randomly drawing that number of loci from simulated data set (1,000 simulated loci per bin; draw with replacement) to generate null distribution consisting of the same number of loci with the same distribution of segregating sites per locus as the empirical data set.
# create a vector of stats files based on simulations
filenames <- list.files(path = "results/", pattern = "Ind14")
# create an empty list that will serve as a container to receive the incoming files
sim <-list()
# create a loop to read in your data
for (i in 1:length(filenames)){
sim[[i]] <- read_delim(file.path("results", filenames[i]), delim = "\t",
col_names = c("t1", "pi",
"t2", "seg_sites",
"t3", "tajima_d",
"t4", "thetaH",
"t5", "H")) %>%
select(pi, seg_sites, tajima_d, thetaH, H)
}
# add names
names(sim) <- filenames
# combine into dataframe
sim <- ldply(sim, data.frame) %>%
rename(simulation = `.id`) %>%
separate(simulation, into = c("temp", "t1"), sep = -6) %>%
separate(temp, into = c("N_IND", "t2", "scenario", "N_REPS"), sep = "_") %>%
mutate(N_REPS = as.numeric(str_extract(N_REPS, "[[:digit:]]+")),
N_IND = as.numeric(str_extract(N_IND, "[[:digit:]]+"))) %>%
select(-t1, -t2)
reps <- 100
# group to be compared
grp <- "S_FRH"
# get counts for number of segregating sites
emp <- read_delim("results/trib_run.hapdiv", delim = "\t") %>%
select(locus, pop, seg_sites) %>%
filter(pop == grp) %>%
filter(!seg_sites == 0) %>%
count(seg_sites)
# create null distributions
null_distributions <- list()
for (s in emp$seg_sites) {
# determine number of loci with s segregating sites
t <- emp %>%
filter(seg_sites == s)
n_loci <- t$n
# sample n_loci simulated loci (with replacement)
null_distributions[[s]] <- sim %>%
filter(seg_sites == s) %>%
sample_n(size = n_loci, replace = TRUE)
}
null_distributions <- ldply(null_distributions, data.frame) %>%
mutate(pop = grp)
taj <- read_delim("results/trib_run.hapdiv", delim = "\t") %>%
select(locus, pop, seg_sites, tajima_d) %>%
filter(!seg_sites == 0,
pop == grp) %>%
bind_rows(null_distributions) %>%
mutate(scenario = ifelse(is.na(scenario), "empirical", scenario)) %>%
select(-N_IND, -N_REPS, -pi, -thetaH, -H)
taj_all[[grp]] <- taj
Use K-S test statistic to identify simulated null distribution most similar to empirical data set (values closer to 0 indicate higher probability to have been draw from the same distribution).
Cumulative distribution curves of empirical data set (black) to simulated data set (grey). The black dashed lines indicates expected value for Tajima’s D under equilibrium assuming a beta distribution (the median should intersect at 0).
##
## Two-sample Kolmogorov-Smirnov test
##
## data: x and y
## D = 0.30666, p-value < 0.00000000000000022
## alternative hypothesis: two-sided
Let’s compare the null and empirical distributions across all data sets.
Comparison of neutral and empirical distributions of Tajima’s D for each run in each tributary.
Table S10: Mean, median, 25% and 75% quantiles Tajima’s D forempirical and simulated data sets.
| pop | scenario | mean | median | Q1 | Q3 |
|---|---|---|---|---|---|
| F_COL | empirical | -0.0714 | -0.3536 | -0.8912 | 0.6642 |
| F_COL | neut | -0.0308 | -0.2007 | -0.8912 | 0.7825 |
| F_MIL | empirical | -0.0932 | -0.2712 | -1.0037 | 0.6690 |
| F_MIL | neut | -0.0501 | -0.3066 | -0.8360 | 0.7445 |
| F_DER | empirical | -0.1072 | -0.3634 | -1.1470 | 0.7267 |
| F_DER | neut | -0.0530 | -0.3228 | -1.1470 | 0.7685 |
| F_BUT | empirical | -0.0955 | -0.3066 | -0.8452 | 0.6820 |
| F_BUT | neut | -0.0299 | -0.2824 | -0.8912 | 0.7825 |
| F_FRH | empirical | -0.0760 | -0.2896 | -0.8810 | 0.6669 |
| F_FRH | neut | -0.0159 | -0.2896 | -0.8810 | 0.9185 |
| F_NIM | empirical | -0.0596 | -0.3337 | -0.8912 | 0.6642 |
| F_NIM | neut | -0.0179 | -0.1879 | -0.8912 | 0.7825 |
| F_MKH | empirical | -0.0589 | -0.2896 | -0.8848 | 0.6669 |
| F_MKH | neut | -0.0381 | -0.3125 | -0.9130 | 0.7505 |
| F_STN | empirical | -0.0809 | -0.3071 | -0.8602 | 0.7231 |
| F_STN | neut | -0.0592 | -0.2595 | -0.8602 | 0.7261 |
| F_TOU | empirical | -0.0683 | -0.3125 | -0.8848 | 0.6669 |
| F_TOU | neut | -0.0197 | -0.1879 | -0.8912 | 0.7825 |
| F_MRH | empirical | -0.0883 | -0.3105 | -1.1470 | 0.7267 |
| F_MRH | neut | -0.0427 | -0.2761 | -1.1470 | 0.8711 |
| F_MER | empirical | -0.0636 | -0.3337 | -0.8912 | 0.6642 |
| F_MER | neut | -0.0231 | -0.2108 | -0.8939 | 0.7411 |
| L_USR | empirical | -0.0465 | -0.2897 | -0.8452 | 0.7445 |
| L_USR | neut | -0.0576 | -0.3385 | -0.8452 | 0.6820 |
| W_USR | empirical | 0.0943 | -0.0790 | -0.8767 | 1.0812 |
| W_USR | neut | -0.0639 | -0.3385 | -0.9230 | 0.6820 |
| S_MIL | empirical | -0.0507 | -0.1747 | -0.9721 | 0.7592 |
| S_MIL | neut | -0.0042 | -0.1384 | -1.0305 | 0.8533 |
| S_DER | empirical | -0.0621 | -0.2649 | -0.8719 | 0.6723 |
| S_DER | neut | -0.0589 | -0.2896 | -0.8810 | 0.7969 |
| S_BUT | empirical | 0.0491 | -0.0201 | -0.8134 | 0.8827 |
| S_BUT | neut | -0.0507 | -0.2712 | -1.1020 | 0.8113 |
| S_FRH | empirical | -0.0874 | -0.3414 | -1.1552 | 0.8423 |
| S_FRH | neut | -0.0478 | -0.3414 | -1.1552 | 0.8423 |
Difference in mean Tajima’s D between empirical and simulateddata sets.
| pop | empirical | neut | difference |
|---|---|---|---|
| W_USR | 0.0943 | -0.0639 | 0.1582 |
| S_BUT | 0.0491 | -0.0507 | 0.0998 |
| L_USR | -0.0465 | -0.0576 | 0.0111 |
| S_DER | -0.0621 | -0.0589 | -0.0031 |
| F_MKH | -0.0589 | -0.0381 | -0.0209 |
| F_STN | -0.0809 | -0.0592 | -0.0217 |
| S_FRH | -0.0874 | -0.0478 | -0.0396 |
| F_MER | -0.0636 | -0.0231 | -0.0405 |
| F_COL | -0.0714 | -0.0308 | -0.0406 |
| F_NIM | -0.0596 | -0.0179 | -0.0417 |
| F_MIL | -0.0932 | -0.0501 | -0.0431 |
| F_MRH | -0.0883 | -0.0427 | -0.0456 |
| S_MIL | -0.0507 | -0.0042 | -0.0464 |
| F_TOU | -0.0683 | -0.0197 | -0.0486 |
| F_DER | -0.1072 | -0.0530 | -0.0541 |
| F_FRH | -0.0760 | -0.0159 | -0.0601 |
| F_BUT | -0.0955 | -0.0299 | -0.0655 |
Difference in mean Tajima’s D between empirical and simulateddata sets.
| pop | empirical | neut | difference |
|---|---|---|---|
| W_USR | -0.0790 | -0.3385 | 0.2594 |
| S_BUT | -0.0201 | -0.2712 | 0.2512 |
| L_USR | -0.2897 | -0.3385 | 0.0488 |
| F_MIL | -0.2712 | -0.3066 | 0.0354 |
| S_DER | -0.2649 | -0.2896 | 0.0247 |
| F_MKH | -0.2896 | -0.3125 | 0.0229 |
| F_FRH | -0.2896 | -0.2896 | 0.0000 |
| S_FRH | -0.3414 | -0.3414 | 0.0000 |
| F_BUT | -0.3066 | -0.2824 | -0.0242 |
| F_MRH | -0.3105 | -0.2761 | -0.0344 |
| S_MIL | -0.1747 | -0.1384 | -0.0364 |
| F_DER | -0.3634 | -0.3228 | -0.0406 |
| F_STN | -0.3071 | -0.2595 | -0.0476 |
| F_MER | -0.3337 | -0.2108 | -0.1229 |
| F_TOU | -0.3125 | -0.1879 | -0.1246 |
| F_NIM | -0.3337 | -0.1879 | -0.1459 |
| F_COL | -0.3536 | -0.2007 | -0.1529 |
Typically the mean is lower than the median but we are more interested in genome-wide patterns so the median is more informative.
The median lower in the empirical data sets compared to the simulated data set under mutation drift equilibrium in the following populations.
Very small to no difference in the medians for the following populations:
The mean and median are higher in the empirical data sets compared to the simulated data set under mutation drift equilibrium in the following populations
Fixed loci are loci that are not polymorphic for a given set of individuals, i.e. all individuals are homozygous for the same allele. The global allele frequency/counts is the allele frequency or count for a given locus across all individuals.
Differences in sample size may create bias for this comparison, as sample sizes that are (too) low can result in not all alleles being present in a population being sampled in a sufficiently representative manner, which can lead loci to appear fixed even though they do have rare alleles that were simply not sampled. Sample sizes are quite variable across run types, but for individuals grouped by run type withing tributaries, the sample sizes do become much more comparable.
Comparison of fixed loci within tributaries within runs
Identify number of fixed loci within individuals grouped by tributary and run type.
Number of fixed alleles for individuals of each tributarygrouped by run type (using rarefied allele counts).
| GRP | n |
|---|---|
| F_COL | 3835 |
| F_MIL | 3574 |
| F_DER | 4898 |
| F_BUT | 4702 |
| F_FRH | 4385 |
| F_NIM | 4369 |
| F_MKH | 3788 |
| F_STN | 3609 |
| F_TOU | 3655 |
| F_MRH | 3504 |
| F_MER | 4191 |
| L_USR | 6416 |
| W_USR | 3853 |
| S_MIL | 4910 |
| S_DER | 5179 |
| S_BUT | 6210 |
| S_FRH | 4423 |
Identify loci that are fixed in more than on set of individuals grouped by tributary and run type: The set size (horizontal green bars) indicates the total number of loci fixed in a given location, the intersection size (vertical orange bars) indicates the number of loci fixed only in a single location (single blue dot) or in two, three, or four locations (indicated by blue dots connected by line).
Calculate allelic richness across all individuals in a data set.
# calculate global allelic richness
setPop(gen) <- ~OVERALL
# calculate allele counts using rarefaction
dat <- genind2hierfstat(gen)
df <- allelic.richness(dat,
diploid = TRUE)
ar <- as.data.frame(df$Ar) %>%
rownames_to_column("LOCUS") %>%
select(-3)
colnames(ar) <- c("LOCUS", "OVERALL")
write_delim(ar, "results/rarefied.allelecount", delim = "\t")
Determine the global rarefied allele counts for loci that are fixed in at least one group.
Distribution of global rarefied allele counts for loci fixed across individuals grouped by tributary within runs.
Table S11: Comparison of mean, median, 25th and 75th quantileof global diversity at fixed loci for individuals grouped byrun/tributary.
| GRP | mean | median | Q1 | Q3 |
|---|---|---|---|---|
| F_COL | 1.20 | 1.13 | 1.07 | 1.30 |
| F_MIL | 1.19 | 1.13 | 1.07 | 1.29 |
| F_DER | 1.30 | 1.19 | 1.07 | 1.46 |
| F_BUT | 1.28 | 1.19 | 1.07 | 1.44 |
| F_FRH | 1.26 | 1.14 | 1.07 | 1.40 |
| F_NIM | 1.25 | 1.14 | 1.07 | 1.39 |
| F_MKH | 1.20 | 1.13 | 1.07 | 1.29 |
| F_STN | 1.18 | 1.13 | 1.07 | 1.25 |
| F_TOU | 1.19 | 1.13 | 1.07 | 1.29 |
| F_MRH | 1.19 | 1.13 | 1.07 | 1.29 |
| F_MER | 1.24 | 1.13 | 1.07 | 1.36 |
| L_USR | 1.46 | 1.35 | 1.12 | 1.79 |
| W_USR | 1.31 | 1.19 | 1.07 | 1.48 |
| S_MIL | 1.36 | 1.24 | 1.07 | 1.59 |
| S_DER | 1.38 | 1.25 | 1.07 | 1.63 |
| S_BUT | 1.41 | 1.30 | 1.07 | 1.70 |
| S_FRH | 1.28 | 1.19 | 1.07 | 1.43 |
Compare the distribution of fixed loci across the genome.
Proportion of loci on chromosome of fixed loci per chromosomefor individuals grouped by run type.
| CHR | total | F_COL | F_MIL | F_DER | F_BUT | F_FRH | F_NIM | F_MKH | F_STN | F_TOU | F_MRH | F_MER | L_USR | W_USR | S_MIL | S_DER | S_BUT | S_FRH |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 8167 | 2.8 | 2.4 | 3.2 | 3.3 | 2.9 | 3.1 | 2.9 | 2.5 | 2.5 | 2.6 | 3.1 | 4.7 | 2.9 | 3.6 | 3.7 | 4.2 | 3.0 |
| 2 | 4869 | 2.4 | 2.1 | 3.1 | 2.9 | 2.9 | 2.6 | 2.1 | 2.3 | 2.2 | 2.4 | 2.5 | 4.1 | 2.5 | 3.1 | 3.2 | 4.0 | 3.1 |
| 3 | 5253 | 2.7 | 2.5 | 3.6 | 3.4 | 3.2 | 3.1 | 2.7 | 2.6 | 2.6 | 2.3 | 3.0 | 4.6 | 2.7 | 3.7 | 3.6 | 4.5 | 3.0 |
| 4 | 5657 | 2.4 | 2.2 | 3.3 | 2.9 | 2.7 | 2.7 | 2.4 | 2.3 | 2.2 | 2.2 | 2.5 | 3.5 | 2.4 | 3.1 | 3.0 | 3.7 | 2.6 |
| 5 | 6268 | 2.8 | 2.6 | 3.6 | 3.5 | 3.1 | 3.3 | 2.8 | 2.6 | 2.6 | 2.4 | 3.0 | 4.4 | 2.6 | 3.4 | 3.7 | 4.1 | 3.1 |
| 6 | 6408 | 2.5 | 2.4 | 3.4 | 3.1 | 2.9 | 2.9 | 2.7 | 2.2 | 2.4 | 2.4 | 2.8 | 4.3 | 2.5 | 3.2 | 3.3 | 3.8 | 3.0 |
| 7 | 5707 | 2.5 | 2.2 | 3.2 | 3.0 | 2.8 | 2.7 | 2.3 | 2.2 | 2.5 | 2.1 | 2.6 | 4.2 | 2.4 | 2.9 | 3.4 | 3.7 | 2.6 |
| 8 | 6666 | 2.3 | 2.1 | 2.9 | 3.0 | 2.7 | 3.0 | 2.5 | 2.5 | 2.3 | 2.3 | 2.7 | 4.3 | 2.4 | 3.0 | 3.4 | 4.3 | 3.1 |
| 9 | 6000 | 2.8 | 2.7 | 3.3 | 3.3 | 3.2 | 3.2 | 2.6 | 2.7 | 2.6 | 2.2 | 2.8 | 4.2 | 2.6 | 3.3 | 3.4 | 4.0 | 3.1 |
| 10 | 5351 | 2.5 | 2.2 | 3.1 | 2.9 | 2.7 | 2.6 | 2.1 | 2.1 | 2.2 | 2.1 | 2.8 | 3.9 | 2.4 | 3.3 | 3.5 | 3.8 | 2.7 |
| 11 | 3095 | 2.8 | 2.3 | 3.1 | 3.3 | 3.2 | 2.8 | 2.7 | 2.4 | 2.5 | 2.5 | 2.6 | 4.9 | 2.6 | 3.2 | 3.4 | 4.4 | 3.1 |
| 12 | 5620 | 2.0 | 1.8 | 2.5 | 2.4 | 2.2 | 2.2 | 1.8 | 1.7 | 1.9 | 1.9 | 2.1 | 3.3 | 2.1 | 2.6 | 2.6 | 3.2 | 2.3 |
| 13 | 5979 | 2.3 | 2.2 | 3.1 | 3.4 | 2.7 | 2.9 | 2.6 | 2.3 | 2.4 | 2.2 | 2.8 | 4.1 | 2.6 | 3.1 | 3.2 | 4.2 | 2.9 |
| 14 | 4284 | 2.5 | 2.4 | 3.5 | 3.2 | 2.7 | 2.6 | 2.3 | 2.5 | 2.7 | 2.3 | 2.9 | 4.7 | 2.5 | 3.5 | 3.5 | 4.4 | 2.8 |
| 15 | 3245 | 1.9 | 2.3 | 2.9 | 3.0 | 2.9 | 2.7 | 1.8 | 2.0 | 2.0 | 2.2 | 2.3 | 3.9 | 2.4 | 2.9 | 3.5 | 3.9 | 3.0 |
| 16 | 5093 | 2.4 | 2.5 | 3.3 | 3.0 | 2.9 | 2.7 | 2.5 | 2.4 | 2.4 | 2.4 | 2.7 | 4.3 | 2.8 | 3.4 | 3.4 | 3.9 | 2.7 |
| 17 | 2070 | 2.2 | 2.2 | 2.7 | 3.0 | 2.6 | 2.3 | 2.1 | 2.1 | 2.1 | 2.3 | 2.7 | 4.2 | 2.4 | 2.8 | 3.0 | 3.9 | 2.6 |
| 18 | 2958 | 2.5 | 2.2 | 3.0 | 2.8 | 2.7 | 2.9 | 2.3 | 2.6 | 2.3 | 2.3 | 2.8 | 4.1 | 2.5 | 3.2 | 3.5 | 4.2 | 2.6 |
| 19 | 5071 | 2.5 | 2.5 | 3.3 | 3.2 | 2.9 | 2.9 | 2.6 | 2.5 | 2.4 | 2.4 | 2.8 | 4.1 | 2.7 | 3.7 | 3.8 | 4.8 | 2.9 |
| 20 | 3797 | 2.5 | 2.4 | 3.5 | 2.9 | 2.9 | 2.8 | 2.4 | 2.3 | 2.6 | 2.5 | 2.9 | 4.6 | 2.3 | 3.3 | 3.6 | 4.2 | 3.1 |
| 21 | 2937 | 2.8 | 2.9 | 3.4 | 2.9 | 3.5 | 2.8 | 2.7 | 2.8 | 2.7 | 2.6 | 3.0 | 4.3 | 2.9 | 3.5 | 4.0 | 4.1 | 2.9 |
| 22 | 3447 | 2.5 | 2.3 | 3.4 | 3.0 | 2.9 | 3.0 | 2.5 | 2.2 | 2.5 | 2.3 | 2.5 | 4.9 | 2.5 | 3.3 | 3.5 | 4.1 | 3.0 |
| 23 | 1756 | 2.7 | 2.1 | 3.2 | 2.9 | 2.7 | 2.5 | 2.5 | 2.1 | 2.2 | 2.0 | 2.5 | 3.8 | 2.3 | 3.2 | 3.2 | 3.5 | 2.9 |
| 24 | 2314 | 2.1 | 2.0 | 3.2 | 3.0 | 2.7 | 2.5 | 2.5 | 2.2 | 2.0 | 2.0 | 2.5 | 3.9 | 2.4 | 3.1 | 3.2 | 4.3 | 3.0 |
| 25 | 2980 | 2.4 | 2.4 | 3.1 | 3.0 | 2.6 | 3.1 | 2.2 | 2.0 | 2.3 | 2.2 | 2.9 | 4.3 | 2.3 | 3.2 | 3.8 | 4.3 | 2.6 |
| 26 | 4047 | 2.7 | 2.5 | 3.5 | 3.1 | 3.1 | 3.1 | 2.9 | 2.6 | 2.3 | 2.4 | 2.7 | 4.8 | 2.5 | 3.2 | 3.6 | 4.3 | 3.0 |
| 27 | 1997 | 2.8 | 2.3 | 3.0 | 2.8 | 3.0 | 3.1 | 2.6 | 2.5 | 2.6 | 2.4 | 3.2 | 3.3 | 2.5 | 3.2 | 2.9 | 3.5 | 3.1 |
| 28 | 3291 | 2.2 | 2.1 | 3.1 | 2.9 | 2.4 | 2.8 | 2.3 | 2.5 | 2.4 | 2.1 | 2.4 | 4.0 | 2.6 | 2.9 | 3.1 | 3.4 | 2.8 |
| 29 | 2272 | 3.0 | 2.6 | 3.5 | 3.7 | 3.3 | 3.2 | 2.8 | 2.6 | 3.0 | 2.3 | 2.9 | 3.9 | 2.8 | 3.4 | 3.5 | 4.6 | 3.2 |
| 30 | 3383 | 3.0 | 3.0 | 3.4 | 3.2 | 3.5 | 3.3 | 3.0 | 2.7 | 2.7 | 2.5 | 3.6 | 4.8 | 2.9 | 3.4 | 3.8 | 4.9 | 3.2 |
| 31 | 2772 | 2.3 | 2.2 | 2.8 | 2.8 | 2.5 | 2.8 | 2.4 | 2.2 | 1.9 | 1.8 | 2.7 | 3.5 | 2.3 | 2.6 | 2.8 | 3.7 | 2.7 |
| 32 | 1274 | 1.3 | 1.6 | 2.0 | 2.0 | 2.1 | 1.6 | 1.6 | 1.7 | 1.6 | 1.6 | 1.9 | 3.1 | 1.8 | 2.4 | 2.2 | 3.5 | 2.4 |
| 33 | 3255 | 2.1 | 2.1 | 2.4 | 2.8 | 2.4 | 2.3 | 2.2 | 2.0 | 2.0 | 1.9 | 2.2 | 3.8 | 1.8 | 2.7 | 2.9 | 3.6 | 2.4 |
| 34 | 1094 | 2.0 | 1.6 | 2.2 | 2.2 | 2.4 | 2.9 | 1.9 | 1.8 | 2.3 | 2.2 | 2.4 | 2.8 | 2.3 | 2.7 | 2.8 | 3.0 | 2.8 |
Determine the null distribution of the proportion of loci expected to be fixed per chromosome if fixed loci are randomly distributed among chromosomes. Maintain the total number of loci fixed for a run/tributary group and the total number of loci per chromosome in the data by shuffling chromosome designations across loci. Then calculate the proportion of loci fixed on a chromosome for each run/tributary group and determining if the observed value falls is higher than the 5th and lower than the 95th percentile; determine significance as observed value lying outside observed distribution.
obs <- tidy_ar %>%
mutate(FIXED = ifelse(AR == 1, TRUE, FALSE)) %>%
mutate(GRP = ordered(GRP, levels = trib_runs)) %>%
left_join(contigs) %>%
filter(!is.na(CHROMOSOME)) %>%
select(GRP, LOCUS, FIXED, CHROMOSOME)
reps <- 1000
sim <- list()
for(i in 1:reps){
sim[[i]] <- obs %>%
group_by(GRP) %>%
mutate(SIM = sample(CHROMOSOME, replace = FALSE)) %>%
ungroup() %>%
filter(FIXED == TRUE) %>%
count(GRP, SIM) %>%
left_join(n_chrom, by = c("SIM" = "CHROMOSOME")) %>%
mutate(percent = (n/total)*100)
}
null_dist <- ldply(sim, data.frame) %>%
group_by(GRP, SIM) %>%
summarize(min = min(percent),
max = max(percent)) %>%
ungroup() %>%
rename(CHROMOSOME = SIM) %>%
left_join(pattern) %>%
select(GRP, CHROMOSOME, total, n, percent, min, max) %>%
mutate(sign = ifelse(percent > min & percent < max, "expected", "significant"))
Compare observed and simulated distributions.
Table S12: Chromosomes with significantly higher/lower thanexpected (outside simulated null distribution) number of fixed loci foreach run/tributary group.
| GRP | CHROMOSOME | total | n | percent | min | max |
|---|---|---|---|---|---|---|
| F_MIL | 9 | 6000 | 161 | 2.68 | 1.62 | 2.65 |
| F_DER | 33 | 3255 | 77 | 2.37 | 2.58 | 4.09 |
| F_FRH | 9 | 6000 | 191 | 3.18 | 2.08 | 3.13 |
| L_USR | 19 | 5071 | 208 | 4.10 | 4.30 | 5.72 |
| L_USR | 24 | 2314 | 91 | 3.93 | 3.93 | 6.09 |
| W_USR | 33 | 3255 | 59 | 1.81 | 1.90 | 3.29 |
| S_FRH | 19 | 5071 | 146 | 2.88 | 2.92 | 4.24 |
Table S13: Mean +/- sd, minimum, and maximum percent of lociper chromosome that are fixed across a group of individuals with thesame run type for a given tributary.
| GRP | mean | sd | min | max |
|---|---|---|---|---|
| F_COL | 2.45 | 0.35 | 1.26 | 3.04 |
| F_MIL | 2.30 | 0.30 | 1.57 | 3.02 |
| F_DER | 3.12 | 0.38 | 2.04 | 3.59 |
| F_BUT | 2.99 | 0.34 | 1.96 | 3.65 |
| F_FRH | 2.82 | 0.32 | 2.12 | 3.47 |
| F_NIM | 2.79 | 0.35 | 1.57 | 3.31 |
| F_MKH | 2.42 | 0.33 | 1.65 | 2.99 |
| F_STN | 2.31 | 0.28 | 1.73 | 2.83 |
| F_TOU | 2.35 | 0.29 | 1.57 | 2.99 |
| F_MRH | 2.24 | 0.22 | 1.57 | 2.58 |
| F_MER | 2.70 | 0.32 | 1.88 | 3.58 |
| L_USR | 4.10 | 0.51 | 2.83 | 4.91 |
| W_USR | 2.46 | 0.25 | 1.81 | 2.90 |
| S_MIL | 3.15 | 0.32 | 2.43 | 3.69 |
| S_DER | 3.33 | 0.38 | 2.20 | 3.98 |
| S_BUT | 4.00 | 0.43 | 3.02 | 4.94 |
| S_FRH | 2.86 | 0.24 | 2.35 | 3.21 |
Singletons are alleles that only occur in a single individual. This could be a locus that is only variable in one individual or a locus that has multiple alleles though one (or more) of those alleles are extremely rare occurring in only one individual.
# group all individuals in single group
setPop(gen) <- ~OVERALL
# calculate basic stats
dat <- genind2hierfstat(gen)
stats <- basic.stats(dat)
# extract & format allele frequencies per locus into single df
f <- stats$pop.freq
freq <- list()
for(l in names(f)){
freq[[l]] <- as.data.frame(f[[l]]) %>%
filter(Var2 == 1) %>%
rename(ALLELE = x,
FRQ = Freq) %>%
select(ALLELE, FRQ)
}
freq <- ldply(freq, data.frame) %>%
rename(LOCUS = `.id`)
# identify singletons
Nind <- length(indNames(gen))
min <- round(1/(2*Nind), digits = 4)
singletons <- freq %>%
filter(FRQ <= min) %>%
mutate(ALLELE = as.integer(ALLELE))
Total number of alleles across all loci in the data set is 29873, the total number of singletons is 363 (1.2% of loci) exhibit at least one singleton.
Number of loci with x singletons.
| singletons | loci |
|---|---|
| 1 | 347 |
| 2 | 8 |
Compare distribution of loci with singletons across the genome to random null distribution.
Chromosomes with significantly higher/lower than expected(outside simulated null distribution) number of loci with observedsingletons.
| CHROMOSOME | min | max | n |
|---|
Determine the number of singletons individuals in each group carry.
Comparison of number of singletons per individual (individual circles); for individuals grouped by tributary and run type.
Table S14: Mean, standard deviation, median, 25th and 27thquantile of the number of singletons observer per individual forindividuals grouped by run/tributary.
| RUN_LOC | mean | sd | median | Q1 | Q3 |
|---|---|---|---|---|---|
| F_COL | 1.55 | 1.00 | 1 | 1.00 | 2.00 |
| F_MIL | 1.38 | 0.65 | 1 | 1.00 | 2.00 |
| F_DER | 1.67 | 1.12 | 1 | 1.00 | 2.00 |
| F_BUT | 1.33 | 0.82 | 1 | 1.00 | 1.00 |
| F_FRH | 1.71 | 0.92 | 1 | 1.00 | 2.00 |
| F_NIM | 2.21 | 1.67 | 1 | 1.00 | 3.75 |
| F_MKH | 2.00 | 0.78 | 2 | 1.25 | 2.75 |
| F_STN | 1.00 | 0.00 | 1 | 1.00 | 1.00 |
| F_TOU | 1.36 | 0.63 | 1 | 1.00 | 1.75 |
| F_MRH | 1.71 | 0.76 | 2 | 1.00 | 2.00 |
| F_MER | 1.47 | 0.80 | 1 | 1.00 | 2.00 |
| L_USR | 2.23 | 1.54 | 2 | 1.00 | 2.00 |
| W_USR | 1.78 | 1.39 | 1 | 1.00 | 2.00 |
| S_MIL | 1.82 | 1.40 | 1 | 1.00 | 2.00 |
| S_DER | 3.00 | 4.64 | 2 | 1.00 | 2.75 |
| S_BUT | 4.00 | 3.46 | 3 | 2.50 | 4.50 |
| S_FRH | 1.25 | 0.50 | 1 | 1.00 | 1.25 |
Identify loci that are polymorphic in one set of individuals (i.e. \(N(alleles) > 1\)) but that locus is not variable in any other set of individuals (i.e. fixed across all other sets of loci). The polymorphisms must occur in at least one individual of a group (i.e. singletons) but could be variable in multiple individuals, the allele other groups are fixed for may also occur in the group it is variable at, i.e. these are not necessarily private alleles, rather they are loci exclusively polymorphic in a single group.
Identify loci only variable among individuals from the same tributary by run type (private polymorphisms)
tidy_ar <- read_delim("results/trib_run_rarefied.allelecount", delim = "\t") %>%
gather(key = GRP, AR, 2:18) %>%
mutate(GRP = ordered(GRP, levels = trib_runs)) %>%
rename(N_ALLELES = AR)
pops <- unique(tidy_ar$GRP)
# df for results
results <- setNames(data.frame(matrix(ncol = 2, nrow = 0)),
c("LOCUS", "POLYMORPHIC")) %>%
mutate(LOCUS = as.character(LOCUS),
POLYMORPHIC = as.character(POLYMORPHIC))
# loop over to identify loci polymorphic in one pop but fixed in all others
for(p in pops){
# find loci polymorphic in pop
polymorphic <- tidy_ar %>%
filter(N_ALLELES > 1 & GRP == p)
# find loci fixed in all other pops
fixed <- tidy_ar %>%
filter(!GRP == p) %>%
group_by(LOCUS) %>%
count(N_ALLELES == 1) %>%
filter(`N_ALLELES == 1` == TRUE) %>%
filter(n == length(pops)-1) %>%
mutate(POLYMORPHIC = p) %>%
select(LOCUS, POLYMORPHIC)
results <- bind_rows(results, fixed)
}
pp <- results %>%
count(LOCUS) %>%
filter(n == 1)
results <- results %>%
filter(LOCUS %in% pp$LOCUS)
Determine differences in number of loci with private polymorphisms among run/tributary groups.
Number of loci polymorphic in individuals of the same run typewithin a tributary that are not variable in individuals of any othergroup based on rarefied allele counts.
| POLYMORPHIC | n |
|---|---|
| W_USR | 153 |
| S_DER | 97 |
| F_MRH | 89 |
| S_FRH | 84 |
| F_MIL | 83 |
| F_TOU | 82 |
| L_USR | 70 |
| S_MIL | 70 |
| F_COL | 57 |
| F_STN | 55 |
| F_MKH | 54 |
| F_DER | 49 |
| F_FRH | 47 |
| F_MER | 46 |
| F_BUT | 44 |
| F_NIM | 33 |
| S_BUT | 18 |
Compare the distribution of exclusively polymorphic loci across the genome.
Determine the percent of loci on a chromosome that are only polymorphic for a set of individuals from the same tributary of the same run type.
Percent loci on a chromosome that are only polymorphic for aset of individuals from the same tributary of the same runtype.
| CHROMOSOME | total | F_COL | F_MIL | F_DER | F_BUT | F_FRH | F_NIM | F_MKH | F_STN | F_TOU | F_MRH | F_MER | L_USR | W_USR | S_MIL | S_DER | S_BUT | S_FRH |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 8167 | 0.09 | 0.10 | 0.04 | 0.04 | 0.04 | 0.04 | 0.02 | 0.04 | 0.07 | 0.10 | 0.04 | 0.05 | 0.07 | 0.07 | 0.11 | 0.04 | 0.02 |
| 2 | 4869 | 0.02 | 0.04 | NA | 0.02 | 0.04 | 0.02 | 0.04 | 0.02 | 0.08 | 0.06 | NA | NA | 0.12 | NA | 0.02 | NA | 0.02 |
| 3 | 5253 | 0.04 | 0.04 | 0.02 | 0.04 | 0.02 | NA | 0.02 | 0.04 | 0.02 | 0.15 | 0.06 | 0.04 | 0.13 | 0.06 | 0.06 | NA | 0.06 |
| 4 | 5657 | 0.02 | 0.02 | 0.05 | 0.04 | 0.02 | 0.02 | 0.07 | 0.11 | 0.04 | 0.05 | 0.04 | 0.04 | 0.12 | 0.02 | 0.05 | NA | 0.05 |
| 5 | 6268 | 0.03 | 0.10 | 0.02 | 0.03 | NA | 0.03 | NA | 0.02 | 0.08 | 0.11 | 0.10 | 0.05 | 0.16 | NA | 0.11 | 0.02 | 0.10 |
| 6 | 6408 | 0.05 | 0.06 | 0.02 | 0.02 | 0.03 | 0.05 | 0.02 | 0.05 | 0.06 | 0.02 | 0.02 | 0.05 | 0.09 | 0.08 | 0.08 | 0.02 | 0.05 |
| 7 | 5707 | 0.05 | 0.07 | 0.02 | NA | 0.07 | NA | 0.02 | 0.02 | 0.04 | 0.12 | 0.07 | 0.04 | 0.18 | 0.07 | 0.07 | 0.04 | 0.07 |
| 8 | 6666 | 0.03 | 0.02 | 0.02 | 0.03 | 0.05 | 0.05 | 0.03 | 0.02 | 0.05 | 0.05 | NA | 0.08 | 0.02 | 0.05 | 0.05 | 0.03 | NA |
| 9 | 6000 | NA | 0.10 | 0.03 | 0.03 | 0.03 | 0.03 | 0.05 | 0.03 | 0.05 | 0.12 | 0.07 | 0.02 | 0.13 | 0.07 | 0.05 | NA | 0.05 |
| 10 | 5351 | 0.02 | 0.06 | 0.04 | 0.04 | NA | 0.04 | 0.06 | NA | 0.02 | 0.02 | 0.02 | 0.09 | 0.11 | NA | 0.07 | NA | 0.09 |
| 11 | 3095 | 0.03 | 0.06 | 0.10 | NA | 0.03 | 0.03 | 0.10 | NA | 0.06 | 0.03 | NA | NA | NA | 0.03 | 0.06 | 0.03 | 0.03 |
| 12 | 5620 | 0.02 | 0.07 | 0.02 | 0.04 | 0.02 | NA | 0.04 | 0.05 | 0.05 | 0.05 | 0.02 | 0.02 | 0.07 | 0.05 | 0.11 | NA | 0.04 |
| 13 | 5979 | 0.05 | 0.05 | NA | NA | 0.02 | NA | 0.03 | 0.03 | 0.02 | 0.08 | 0.03 | 0.03 | 0.07 | 0.05 | 0.03 | 0.02 | 0.07 |
| 14 | 4284 | 0.12 | 0.09 | 0.02 | NA | 0.05 | 0.02 | 0.02 | 0.02 | 0.05 | 0.09 | 0.02 | 0.05 | 0.07 | 0.05 | 0.02 | NA | 0.05 |
| 15 | 3245 | 0.03 | NA | 0.03 | 0.03 | NA | 0.03 | 0.06 | 0.06 | 0.06 | NA | 0.06 | NA | 0.12 | NA | 0.03 | 0.03 | 0.03 |
| 16 | 5093 | 0.04 | 0.02 | 0.04 | 0.06 | 0.06 | NA | 0.08 | NA | 0.14 | 0.02 | 0.02 | 0.02 | 0.08 | NA | 0.04 | 0.02 | 0.06 |
| 17 | 2070 | NA | NA | NA | NA | 0.05 | NA | 0.10 | 0.05 | 0.10 | 0.10 | NA | NA | 0.14 | NA | 0.10 | 0.05 | NA |
| 18 | 2958 | NA | 0.03 | 0.14 | 0.10 | NA | NA | NA | NA | NA | 0.03 | 0.07 | 0.03 | 0.14 | 0.07 | 0.10 | NA | 0.03 |
| 19 | 5071 | 0.08 | NA | 0.06 | 0.04 | 0.12 | 0.04 | NA | 0.04 | 0.12 | 0.06 | NA | 0.04 | 0.04 | 0.02 | 0.04 | NA | 0.04 |
| 20 | 3797 | NA | 0.03 | 0.05 | NA | NA | NA | 0.05 | 0.03 | NA | 0.03 | NA | 0.03 | 0.16 | 0.05 | 0.08 | NA | 0.08 |
| 21 | 2937 | 0.03 | 0.03 | 0.03 | 0.03 | 0.07 | NA | 0.03 | 0.03 | 0.07 | 0.03 | 0.03 | 0.17 | 0.03 | 0.03 | NA | NA | 0.10 |
| 22 | 3447 | NA | 0.03 | NA | 0.03 | 0.03 | NA | 0.03 | 0.03 | NA | 0.09 | 0.03 | 0.12 | 0.03 | 0.09 | 0.03 | NA | 0.09 |
| 23 | 1756 | 0.06 | 0.06 | 0.06 | NA | NA | NA | NA | NA | NA | 0.06 | NA | 0.17 | 0.17 | NA | NA | NA | 0.11 |
| 24 | 2314 | NA | 0.04 | NA | 0.04 | 0.04 | NA | 0.04 | 0.09 | 0.13 | 0.04 | 0.04 | NA | 0.04 | NA | 0.09 | NA | 0.09 |
| 25 | 2980 | 0.10 | NA | NA | NA | 0.03 | NA | 0.03 | 0.10 | NA | NA | NA | 0.03 | 0.13 | 0.07 | 0.07 | NA | 0.07 |
| 26 | 4047 | 0.02 | 0.05 | 0.07 | 0.02 | 0.05 | NA | 0.05 | NA | 0.10 | 0.05 | 0.02 | 0.05 | 0.27 | 0.05 | 0.07 | NA | 0.02 |
| 27 | 1997 | 0.10 | 0.15 | 0.10 | NA | 0.05 | NA | 0.05 | 0.05 | 0.05 | NA | NA | 0.05 | 0.10 | 0.05 | 0.05 | 0.05 | 0.05 |
| 28 | 3291 | NA | 0.09 | 0.06 | 0.03 | 0.06 | NA | 0.03 | NA | NA | 0.15 | 0.06 | NA | 0.03 | 0.06 | 0.12 | NA | 0.03 |
| 29 | 2272 | 0.04 | 0.04 | 0.04 | 0.04 | NA | NA | NA | 0.13 | NA | NA | 0.09 | NA | 0.09 | 0.04 | 0.09 | NA | 0.04 |
| 30 | 3383 | 0.03 | 0.06 | NA | 0.15 | NA | NA | 0.03 | 0.09 | 0.03 | 0.03 | NA | 0.06 | 0.06 | 0.03 | 0.09 | 0.03 | 0.06 |
| 31 | 2772 | 0.07 | 0.14 | 0.04 | 0.04 | 0.04 | 0.14 | NA | 0.07 | NA | NA | 0.04 | 0.04 | 0.04 | 0.07 | NA | NA | NA |
| 32 | 1274 | 0.08 | NA | NA | NA | NA | 0.08 | NA | NA | NA | 0.08 | NA | 0.08 | 0.08 | NA | 0.08 | NA | NA |
| 33 | 3255 | NA | NA | NA | NA | NA | 0.03 | 0.03 | NA | 0.03 | NA | 0.06 | 0.06 | 0.12 | 0.06 | 0.12 | NA | 0.06 |
| 34 | 1094 | 0.09 | 0.18 | NA | 0.09 | 0.09 | NA | NA | NA | 0.09 | NA | NA | 0.18 | 0.09 | NA | NA | NA | NA |
Generate null distribution to determine if some groups exhibit more/less than the expected number of loci with exclusive polymorphisms for sets of individuals.
Chromosomes with significantly higher/lower than expected(outside simulated null distribution) number of private polymorhisms foreach run/tributary group
| GRP | CHROMOSOME | total | n | percent | min | max | sign |
|---|---|---|---|---|---|---|---|
| F_COL | 2 | 4869 | 1 | 0.0205 | 0.0205 | 0.1438 | lower |
| F_COL | 4 | 5657 | 1 | 0.0177 | 0.0177 | 0.1414 | lower |
| F_COL | 10 | 5351 | 1 | 0.0187 | 0.0187 | 0.1308 | lower |
| F_COL | 11 | 3095 | 1 | 0.0323 | 0.0323 | 0.1939 | lower |
| F_COL | 12 | 5620 | 1 | 0.0178 | 0.0178 | 0.1246 | lower |
| F_COL | 15 | 3245 | 1 | 0.0308 | 0.0308 | 0.1849 | lower |
| F_COL | 21 | 2937 | 1 | 0.0340 | 0.0340 | 0.2043 | lower |
| F_COL | 23 | 1756 | 1 | 0.0569 | 0.0569 | 0.2847 | lower |
| F_COL | 26 | 4047 | 1 | 0.0247 | 0.0247 | 0.1483 | lower |
| F_COL | 29 | 2272 | 1 | 0.0440 | 0.0440 | 0.2201 | lower |
| F_COL | 30 | 3383 | 1 | 0.0296 | 0.0296 | 0.2069 | lower |
| F_COL | 32 | 1274 | 1 | 0.0785 | 0.0785 | 0.2355 | lower |
| F_COL | 34 | 1094 | 1 | 0.0914 | 0.0914 | 0.3656 | lower |
| F_MIL | 4 | 5657 | 1 | 0.0177 | 0.0177 | 0.1768 | lower |
| F_MIL | 8 | 6666 | 1 | 0.0150 | 0.0150 | 0.1650 | lower |
| F_MIL | 16 | 5093 | 1 | 0.0196 | 0.0196 | 0.1963 | lower |
| F_MIL | 18 | 2958 | 1 | 0.0338 | 0.0338 | 0.2028 | lower |
| F_MIL | 20 | 3797 | 1 | 0.0263 | 0.0263 | 0.2107 | lower |
| F_MIL | 21 | 2937 | 1 | 0.0340 | 0.0340 | 0.2383 | lower |
| F_MIL | 22 | 3447 | 1 | 0.0290 | 0.0290 | 0.2611 | lower |
| F_MIL | 23 | 1756 | 1 | 0.0569 | 0.0569 | 0.3417 | lower |
| F_MIL | 24 | 2314 | 1 | 0.0432 | 0.0432 | 0.2593 | lower |
| F_MIL | 29 | 2272 | 1 | 0.0440 | 0.0440 | 0.2641 | lower |
| F_DER | 3 | 5253 | 1 | 0.0190 | 0.0190 | 0.1333 | lower |
| F_DER | 5 | 6268 | 1 | 0.0160 | 0.0160 | 0.1117 | lower |
| F_DER | 6 | 6408 | 1 | 0.0156 | 0.0156 | 0.1248 | lower |
| F_DER | 7 | 5707 | 1 | 0.0175 | 0.0175 | 0.1227 | lower |
| F_DER | 8 | 6666 | 1 | 0.0150 | 0.0150 | 0.1350 | lower |
| F_DER | 12 | 5620 | 1 | 0.0178 | 0.0178 | 0.1423 | lower |
| F_DER | 14 | 4284 | 1 | 0.0233 | 0.0233 | 0.1867 | lower |
| F_DER | 15 | 3245 | 1 | 0.0308 | 0.0308 | 0.1849 | lower |
| F_DER | 21 | 2937 | 1 | 0.0340 | 0.0340 | 0.1702 | lower |
| F_DER | 23 | 1756 | 1 | 0.0569 | 0.0569 | 0.2847 | lower |
| F_DER | 29 | 2272 | 1 | 0.0440 | 0.0440 | 0.1761 | lower |
| F_DER | 31 | 2772 | 1 | 0.0361 | 0.0361 | 0.1804 | lower |
| F_BUT | 2 | 4869 | 1 | 0.0205 | 0.0205 | 0.1232 | lower |
| F_BUT | 6 | 6408 | 1 | 0.0156 | 0.0156 | 0.1092 | lower |
| F_BUT | 15 | 3245 | 1 | 0.0308 | 0.0308 | 0.1541 | lower |
| F_BUT | 21 | 2937 | 1 | 0.0340 | 0.0340 | 0.1702 | lower |
| F_BUT | 22 | 3447 | 1 | 0.0290 | 0.0290 | 0.1451 | lower |
| F_BUT | 24 | 2314 | 1 | 0.0432 | 0.0432 | 0.2161 | lower |
| F_BUT | 26 | 4047 | 1 | 0.0247 | 0.0247 | 0.1235 | lower |
| F_BUT | 28 | 3291 | 1 | 0.0304 | 0.0304 | 0.1519 | lower |
| F_BUT | 29 | 2272 | 1 | 0.0440 | 0.0440 | 0.2201 | lower |
| F_BUT | 30 | 3383 | 5 | 0.1478 | 0.0296 | 0.1478 | higher |
| F_BUT | 31 | 2772 | 1 | 0.0361 | 0.0361 | 0.1804 | lower |
| F_BUT | 34 | 1094 | 1 | 0.0914 | 0.0914 | 0.3656 | lower |
| F_FRH | 3 | 5253 | 1 | 0.0190 | 0.0190 | 0.1333 | lower |
| F_FRH | 4 | 5657 | 1 | 0.0177 | 0.0177 | 0.1237 | lower |
| F_FRH | 11 | 3095 | 1 | 0.0323 | 0.0323 | 0.1939 | lower |
| F_FRH | 12 | 5620 | 1 | 0.0178 | 0.0178 | 0.1068 | lower |
| F_FRH | 13 | 5979 | 1 | 0.0167 | 0.0167 | 0.1338 | lower |
| F_FRH | 17 | 2070 | 1 | 0.0483 | 0.0483 | 0.2899 | lower |
| F_FRH | 22 | 3447 | 1 | 0.0290 | 0.0290 | 0.1451 | lower |
| F_FRH | 24 | 2314 | 1 | 0.0432 | 0.0432 | 0.2593 | lower |
| F_FRH | 25 | 2980 | 1 | 0.0336 | 0.0336 | 0.1678 | lower |
| F_FRH | 27 | 1997 | 1 | 0.0501 | 0.0501 | 0.2504 | lower |
| F_FRH | 31 | 2772 | 1 | 0.0361 | 0.0361 | 0.1443 | lower |
| F_FRH | 34 | 1094 | 1 | 0.0914 | 0.0914 | 0.2742 | lower |
| F_NIM | 2 | 4869 | 1 | 0.0205 | 0.0205 | 0.1027 | lower |
| F_NIM | 4 | 5657 | 1 | 0.0177 | 0.0177 | 0.0884 | lower |
| F_NIM | 11 | 3095 | 1 | 0.0323 | 0.0323 | 0.1616 | lower |
| F_NIM | 14 | 4284 | 1 | 0.0233 | 0.0233 | 0.1167 | lower |
| F_NIM | 15 | 3245 | 1 | 0.0308 | 0.0308 | 0.1541 | lower |
| F_NIM | 31 | 2772 | 4 | 0.1443 | 0.0361 | 0.1443 | higher |
| F_NIM | 32 | 1274 | 1 | 0.0785 | 0.0785 | 0.1570 | lower |
| F_NIM | 33 | 3255 | 1 | 0.0307 | 0.0307 | 0.1229 | lower |
| F_MKH | 3 | 5253 | 1 | 0.0190 | 0.0190 | 0.1333 | lower |
| F_MKH | 6 | 6408 | 1 | 0.0156 | 0.0156 | 0.1248 | lower |
| F_MKH | 7 | 5707 | 1 | 0.0175 | 0.0175 | 0.1051 | lower |
| F_MKH | 14 | 4284 | 1 | 0.0233 | 0.0233 | 0.1401 | lower |
| F_MKH | 21 | 2937 | 1 | 0.0340 | 0.0340 | 0.1702 | lower |
| F_MKH | 22 | 3447 | 1 | 0.0290 | 0.0290 | 0.1451 | lower |
| F_MKH | 24 | 2314 | 1 | 0.0432 | 0.0432 | 0.2161 | lower |
| F_MKH | 25 | 2980 | 1 | 0.0336 | 0.0336 | 0.1678 | lower |
| F_MKH | 27 | 1997 | 1 | 0.0501 | 0.0501 | 0.2003 | lower |
| F_MKH | 28 | 3291 | 1 | 0.0304 | 0.0304 | 0.1519 | lower |
| F_MKH | 30 | 3383 | 1 | 0.0296 | 0.0296 | 0.2069 | lower |
| F_MKH | 33 | 3255 | 1 | 0.0307 | 0.0307 | 0.1536 | lower |
| F_STN | 2 | 4869 | 1 | 0.0205 | 0.0205 | 0.1232 | lower |
| F_STN | 5 | 6268 | 1 | 0.0160 | 0.0160 | 0.1117 | lower |
| F_STN | 7 | 5707 | 1 | 0.0175 | 0.0175 | 0.1402 | lower |
| F_STN | 8 | 6666 | 1 | 0.0150 | 0.0150 | 0.1200 | lower |
| F_STN | 14 | 4284 | 1 | 0.0233 | 0.0233 | 0.1401 | lower |
| F_STN | 17 | 2070 | 1 | 0.0483 | 0.0483 | 0.2415 | lower |
| F_STN | 20 | 3797 | 1 | 0.0263 | 0.0263 | 0.1580 | lower |
| F_STN | 21 | 2937 | 1 | 0.0340 | 0.0340 | 0.2043 | lower |
| F_STN | 22 | 3447 | 1 | 0.0290 | 0.0290 | 0.1741 | lower |
| F_STN | 27 | 1997 | 1 | 0.0501 | 0.0501 | 0.2504 | lower |
| F_TOU | 3 | 5253 | 1 | 0.0190 | 0.0190 | 0.1904 | lower |
| F_TOU | 10 | 5351 | 1 | 0.0187 | 0.0187 | 0.1495 | lower |
| F_TOU | 13 | 5979 | 1 | 0.0167 | 0.0167 | 0.1505 | lower |
| F_TOU | 27 | 1997 | 1 | 0.0501 | 0.0501 | 0.2504 | lower |
| F_TOU | 30 | 3383 | 1 | 0.0296 | 0.0296 | 0.2069 | lower |
| F_TOU | 33 | 3255 | 1 | 0.0307 | 0.0307 | 0.2151 | lower |
| F_TOU | 34 | 1094 | 1 | 0.0914 | 0.0914 | 0.4570 | lower |
| F_MRH | 6 | 6408 | 1 | 0.0156 | 0.0156 | 0.1717 | lower |
| F_MRH | 10 | 5351 | 1 | 0.0187 | 0.0187 | 0.2056 | lower |
| F_MRH | 11 | 3095 | 1 | 0.0323 | 0.0323 | 0.2908 | lower |
| F_MRH | 16 | 5093 | 1 | 0.0196 | 0.0196 | 0.1963 | lower |
| F_MRH | 18 | 2958 | 1 | 0.0338 | 0.0338 | 0.2705 | lower |
| F_MRH | 20 | 3797 | 1 | 0.0263 | 0.0263 | 0.2107 | lower |
| F_MRH | 21 | 2937 | 1 | 0.0340 | 0.0340 | 0.2724 | lower |
| F_MRH | 23 | 1756 | 1 | 0.0569 | 0.0569 | 0.2847 | lower |
| F_MRH | 24 | 2314 | 1 | 0.0432 | 0.0432 | 0.3025 | lower |
| F_MRH | 30 | 3383 | 1 | 0.0296 | 0.0296 | 0.2069 | lower |
| F_MRH | 32 | 1274 | 1 | 0.0785 | 0.0785 | 0.3925 | lower |
| F_MER | 6 | 6408 | 1 | 0.0156 | 0.0156 | 0.1092 | lower |
| F_MER | 10 | 5351 | 1 | 0.0187 | 0.0187 | 0.1308 | lower |
| F_MER | 12 | 5620 | 1 | 0.0178 | 0.0178 | 0.1068 | lower |
| F_MER | 14 | 4284 | 1 | 0.0233 | 0.0233 | 0.1167 | lower |
| F_MER | 16 | 5093 | 1 | 0.0196 | 0.0196 | 0.1571 | lower |
| F_MER | 21 | 2937 | 1 | 0.0340 | 0.0340 | 0.1702 | lower |
| F_MER | 22 | 3447 | 1 | 0.0290 | 0.0290 | 0.1741 | lower |
| F_MER | 24 | 2314 | 1 | 0.0432 | 0.0432 | 0.1729 | lower |
| F_MER | 26 | 4047 | 1 | 0.0247 | 0.0247 | 0.1730 | lower |
| F_MER | 31 | 2772 | 1 | 0.0361 | 0.0361 | 0.1443 | lower |
| L_USR | 9 | 6000 | 1 | 0.0167 | 0.0167 | 0.1333 | lower |
| L_USR | 12 | 5620 | 1 | 0.0178 | 0.0178 | 0.1246 | lower |
| L_USR | 16 | 5093 | 1 | 0.0196 | 0.0196 | 0.1571 | lower |
| L_USR | 18 | 2958 | 1 | 0.0338 | 0.0338 | 0.2366 | lower |
| L_USR | 20 | 3797 | 1 | 0.0263 | 0.0263 | 0.2634 | lower |
| L_USR | 25 | 2980 | 1 | 0.0336 | 0.0336 | 0.2349 | lower |
| L_USR | 27 | 1997 | 1 | 0.0501 | 0.0501 | 0.2504 | lower |
| L_USR | 31 | 2772 | 1 | 0.0361 | 0.0361 | 0.2165 | lower |
| L_USR | 32 | 1274 | 1 | 0.0785 | 0.0785 | 0.3925 | lower |
| W_USR | 8 | 6666 | 1 | 0.0150 | 0.0150 | 0.2250 | lower |
| W_USR | 21 | 2937 | 1 | 0.0340 | 0.0340 | 0.3064 | lower |
| W_USR | 22 | 3447 | 1 | 0.0290 | 0.0290 | 0.2901 | lower |
| W_USR | 24 | 2314 | 1 | 0.0432 | 0.0432 | 0.3889 | lower |
| W_USR | 26 | 4047 | 11 | 0.2718 | 0.0247 | 0.2471 | higher |
| W_USR | 28 | 3291 | 1 | 0.0304 | 0.0304 | 0.3039 | lower |
| W_USR | 31 | 2772 | 1 | 0.0361 | 0.0361 | 0.3247 | lower |
| W_USR | 32 | 1274 | 1 | 0.0785 | 0.0785 | 0.4710 | lower |
| W_USR | 34 | 1094 | 1 | 0.0914 | 0.0914 | 0.6399 | lower |
| S_MIL | 4 | 5657 | 1 | 0.0177 | 0.0177 | 0.1237 | lower |
| S_MIL | 11 | 3095 | 1 | 0.0323 | 0.0323 | 0.1939 | lower |
| S_MIL | 19 | 5071 | 1 | 0.0197 | 0.0197 | 0.1775 | lower |
| S_MIL | 21 | 2937 | 1 | 0.0340 | 0.0340 | 0.2043 | lower |
| S_MIL | 27 | 1997 | 1 | 0.0501 | 0.0501 | 0.2504 | lower |
| S_MIL | 29 | 2272 | 1 | 0.0440 | 0.0440 | 0.2201 | lower |
| S_MIL | 30 | 3383 | 1 | 0.0296 | 0.0296 | 0.1774 | lower |
| S_DER | 2 | 4869 | 1 | 0.0205 | 0.0205 | 0.2054 | lower |
| S_DER | 14 | 4284 | 1 | 0.0233 | 0.0233 | 0.2568 | lower |
| S_DER | 15 | 3245 | 1 | 0.0308 | 0.0308 | 0.2465 | lower |
| S_DER | 22 | 3447 | 1 | 0.0290 | 0.0290 | 0.2031 | lower |
| S_DER | 27 | 1997 | 1 | 0.0501 | 0.0501 | 0.2504 | lower |
| S_DER | 32 | 1274 | 1 | 0.0785 | 0.0785 | 0.3925 | lower |
| S_BUT | 5 | 6268 | 1 | 0.0160 | 0.0160 | 0.0638 | lower |
| S_BUT | 6 | 6408 | 1 | 0.0156 | 0.0156 | 0.0624 | lower |
| S_BUT | 11 | 3095 | 1 | 0.0323 | 0.0323 | 0.0969 | lower |
| S_BUT | 13 | 5979 | 1 | 0.0167 | 0.0167 | 0.0669 | lower |
| S_BUT | 15 | 3245 | 1 | 0.0308 | 0.0308 | 0.0924 | lower |
| S_BUT | 16 | 5093 | 1 | 0.0196 | 0.0196 | 0.0785 | lower |
| S_BUT | 17 | 2070 | 1 | 0.0483 | 0.0483 | 0.1449 | lower |
| S_BUT | 27 | 1997 | 1 | 0.0501 | 0.0501 | 0.1502 | lower |
| S_BUT | 30 | 3383 | 1 | 0.0296 | 0.0296 | 0.0887 | lower |
| S_FRH | 2 | 4869 | 1 | 0.0205 | 0.0205 | 0.1643 | lower |
| S_FRH | 11 | 3095 | 1 | 0.0323 | 0.0323 | 0.2585 | lower |
| S_FRH | 15 | 3245 | 1 | 0.0308 | 0.0308 | 0.2157 | lower |
| S_FRH | 18 | 2958 | 1 | 0.0338 | 0.0338 | 0.2028 | lower |
| S_FRH | 26 | 4047 | 1 | 0.0247 | 0.0247 | 0.1730 | lower |
| S_FRH | 27 | 1997 | 1 | 0.0501 | 0.0501 | 0.2504 | lower |
| S_FRH | 28 | 3291 | 1 | 0.0304 | 0.0304 | 0.2127 | lower |
| S_FRH | 29 | 2272 | 1 | 0.0440 | 0.0440 | 0.2641 | lower |
Table S15: Comparison across run/tributary group of the mean+/- sd, minimum, and maximum percent of loci on a chromosome that areonly polymorphic for a set of individuals from the same tributary of thesame run type.
| POLYMORPHIC | mean | sd | min | max |
|---|---|---|---|---|
| F_COL | 0.05 | 0.03 | 0.02 | 0.12 |
| F_MIL | 0.07 | 0.04 | 0.02 | 0.18 |
| F_DER | 0.05 | 0.03 | 0.02 | 0.14 |
| F_BUT | 0.04 | 0.03 | 0.02 | 0.15 |
| F_FRH | 0.05 | 0.02 | 0.02 | 0.12 |
| F_NIM | 0.04 | 0.03 | 0.02 | 0.14 |
| F_MKH | 0.04 | 0.02 | 0.02 | 0.10 |
| F_STN | 0.05 | 0.03 | 0.02 | 0.13 |
| F_TOU | 0.06 | 0.03 | 0.02 | 0.14 |
| F_MRH | 0.07 | 0.04 | 0.02 | 0.15 |
| F_MER | 0.05 | 0.02 | 0.02 | 0.10 |
| L_USR | 0.06 | 0.05 | 0.02 | 0.18 |
| W_USR | 0.10 | 0.05 | 0.02 | 0.27 |
| S_MIL | 0.05 | 0.02 | 0.02 | 0.09 |
| S_DER | 0.07 | 0.03 | 0.02 | 0.12 |
| S_BUT | 0.03 | 0.01 | 0.02 | 0.05 |
| S_FRH | 0.06 | 0.03 | 0.02 | 0.11 |
Table S16: Number of chromosomes with significantly more/lessloci with private polymorphisms than expected.
| GRP | sign | n |
|---|---|---|
| F_COL | lower | 13 |
| F_DER | lower | 12 |
| F_FRH | lower | 12 |
| F_MKH | lower | 12 |
| F_BUT | lower | 11 |
| F_MRH | lower | 11 |
| F_MIL | lower | 10 |
| F_STN | lower | 10 |
| F_MER | lower | 10 |
| L_USR | lower | 9 |
| S_BUT | lower | 9 |
| W_USR | lower | 8 |
| S_FRH | lower | 8 |
| F_NIM | lower | 7 |
| F_TOU | lower | 7 |
| S_MIL | lower | 7 |
| S_DER | lower | 6 |
| F_BUT | higher | 1 |
| F_NIM | higher | 1 |
| W_USR | higher | 1 |
Table S17: Number of run/trib groups a chromosome has beenflagged as having significantly more/less loci with privatepolymorphisms than expected.
| CHROMOSOME | sign | n |
|---|---|---|
| 21 | lower | 10 |
| 27 | lower | 9 |
| 22 | lower | 8 |
| 11 | lower | 7 |
| 15 | lower | 7 |
| 24 | lower | 7 |
| 2 | lower | 6 |
| 6 | lower | 6 |
| 14 | lower | 6 |
| 29 | lower | 6 |
| 30 | lower | 6 |
| 31 | lower | 6 |
| 32 | lower | 6 |
| 4 | lower | 5 |
| 12 | lower | 5 |
| 16 | lower | 5 |
| 34 | lower | 5 |
| 3 | lower | 4 |
| 8 | lower | 4 |
| 10 | lower | 4 |
| 18 | lower | 4 |
| 20 | lower | 4 |
| 23 | lower | 4 |
| 26 | lower | 4 |
| 28 | lower | 4 |
| 5 | lower | 3 |
| 7 | lower | 3 |
| 13 | lower | 3 |
| 17 | lower | 3 |
| 25 | lower | 3 |
| 33 | lower | 3 |
| 9 | lower | 1 |
| 19 | lower | 1 |
| 26 | higher | 1 |
| 30 | higher | 1 |
| 31 | higher | 1 |
Identify loci only variable among individuals from the same run type (private polymorphisms)
For this analysis we will exclude hatchery individuals and combine wild individuals from each run and the subset to 21 individuals (number of individuals in Late Fall run data set) to identify the number of alleles that occur only in individuals of a given run.
# wild populations
wild <- strata %>%
filter(SOURCE == "WILD")
# subset genind object
gen_wild <- gen[row.names(gen@tab) %in% wild$LIB_ID]
# by tributary within run
setPop(gen_wild) <- ~RUN
# calculate allele counts using rarefaction ----
dat <- genind2hierfstat(gen_wild)
df <- allelic.richness(dat,
diploid = TRUE)
df <- as.data.frame(df$Ar) %>%
rownames_to_column("LOCUS")
write_delim(df, "results/run_rarefied.allelecount", delim = "\t")
# identify private polymorphisms ----
tidy_ar <-df %>%
gather(key = GRP, AR, 2:5) %>%
mutate(GRP = ordered(GRP, levels = run_levels)) %>%
rename(N_ALLELES = AR)
pops <- unique(tidy_ar$GRP)
# df for results
results <- setNames(data.frame(matrix(ncol = 2, nrow = 0)),
c("LOCUS", "POLYMORPHIC")) %>%
mutate(LOCUS = as.character(LOCUS),
POLYMORPHIC = as.character(POLYMORPHIC))
# loop over to identify loci polymorphic in one pop but fixed in all others
for(p in pops){
# find loci polymorphic in pop
polymorphic <- tidy_ar %>%
filter(N_ALLELES > 1 & GRP == p)
# find loci fixed in all other pops
fixed <- tidy_ar %>%
filter(!GRP == p) %>%
group_by(LOCUS) %>%
count(N_ALLELES == 1) %>%
filter(`N_ALLELES == 1` == TRUE) %>%
filter(n == length(pops)-1) %>%
mutate(POLYMORPHIC = p) %>%
select(LOCUS, POLYMORPHIC)
results <- bind_rows(results, fixed)
}
pp <- results %>%
count(LOCUS) %>%
filter(n == 1)
results <- results %>%
filter(LOCUS %in% pp$LOCUS)
Determine differences in number of loci with private polymorphisms among run/tributary groups.
Number of loci polymorphic in individuals of the same run typethat are not variable in individuals of any other run based on rarefiedallele counts.
| population | n |
|---|---|
| Fall | 1348 |
| Late-Fall | 122 |
| Spring | 580 |
| Winter | 95 |
Private alleles are alleles that only occur within a given set of individuals; this can be a locus that is only variable in one set of individuals (all other individuals are fixed for the alternate allele(s)) or it could be a locus with multiple alleles, that are variable in other sets of individuals (i.e. heterozygotes & homozygotes occur) but one (or more) of those alleles only exists within a given set of individuals. Not all individuals of the group will have the private allele, i.e. they may be more or less common.
Comparison of private alleles by tributary within run type
There is a relationship between sample size and the number of private alleles found making it different to compare levels overall (many private alleles are rare so for smaller sample sizes they might not be recovered); though it is important to note that all run/tributary groups do exhibit private alleles, i.e. harbor unique genetic material.
Comparison of sample size and number of private alleles detected in each trib/run population.
To account for this, we subsampled down to 15 individuals 100 times to determine the number of private alleles occurring in each subset of individuals.
# unset the seed to get individually different reps
set.seed(NULL)
# list for results
priv_allele_counts <- list()
# number of individuals to subset
n <- 20
# feather river spring individuals
for(i in 1:100){
# draw up to n individuals from each population
sample_ind <- strata %>%
group_by(RUN_LOC) %>%
slice_sample(n = n)
# subset genind object
gen_subset <- gen[row.names(gen@tab) %in% sample_ind$LIB_ID]
# identify private alleles
alleles_private <- poppr::private_alleles(gen_subset,
alleles ~ RUN_LOC,
report = "data.frame")
# count no private alleles per population
priv_allele_counts[[i]] <- alleles_private %>%
filter(count > 0) %>%
count(population) %>%
mutate(rep = i)
}
results <- ldply(priv_allele_counts, data.frame)
write_delim(results, "results/priv-alleles_15ind-100reps.txt",
delim = "\t")
Calculate the mean number of private alleles across all bootstraps.
Compariso of sample size and number of private alleles detected in each trib/run population after adjusting for sampling size by bootstrapping to 15 individuals.
Number of private alleles occurring in each set of individualsgrouped by run type within tributaries. All populations except for S_FRHwere subsampled down to 15 individuals.
| population | mean | sd |
|---|---|---|
| F_COL | 140.21 | 9.706079 |
| F_MIL | 130.80 | 6.843119 |
| F_DER | 140.63 | 6.539167 |
| F_BUT | 90.48 | 7.576519 |
| F_FRH | 139.50 | 13.858958 |
| F_NIM | 123.39 | 11.653148 |
| F_MKH | 115.67 | 11.751686 |
| F_STN | 104.04 | 7.961930 |
| F_TOU | 94.75 | 8.867572 |
| F_MRH | 132.01 | 6.330135 |
| F_MER | 93.19 | 8.604315 |
| L_USR | 134.58 | 9.584478 |
| W_USR | 122.99 | 9.408244 |
| S_MIL | 193.20 | 10.078480 |
| S_DER | 203.80 | 38.640273 |
| S_BUT | 185.43 | 24.757513 |
| S_FRH | 59.46 | 4.875045 |
Populations with high standard deviations are either hatchery individuals or have large number of private polymorphisms (which set of individuals that you pull out can have a larger impact on number of private alleles depending on whether you happen to pull a set with a large number of private polymorphisms or not).
Compare the distribution of loci with private alleles across the genome.
Determine the % loci of a given chromosome that are only polymorphic for a set of individuals from the same tributary of the same run type.
| CHROMOSOME | total | F_COL | F_MIL | F_DER | F_BUT | F_FRH | F_NIM | F_MKH | F_STN | F_TOU | F_MRH | F_MER | L_USR | W_USR | S_MIL | S_DER | S_BUT | S_FRH |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 8167 | 0.12 | 0.09 | 0.09 | 0.05 | 0.16 | 0.23 | 0.12 | 0.06 | 0.02 | 0.06 | 0.07 | 0.04 | 0.07 | 0.10 | 0.17 | 0.17 | 0.04 |
| 2 | 4869 | 0.08 | 0.06 | 0.02 | 0.04 | 0.10 | 0.12 | 0.06 | NA | 0.08 | 0.06 | 0.06 | 0.02 | NA | 0.02 | 0.14 | 0.12 | NA |
| 3 | 5253 | 0.23 | 0.08 | 0.04 | NA | 0.08 | 0.10 | 0.11 | 0.06 | 0.08 | 0.04 | 0.08 | 0.08 | 0.06 | 0.13 | 0.19 | 0.17 | NA |
| 4 | 5657 | 0.11 | 0.05 | 0.05 | 0.05 | 0.04 | 0.09 | 0.02 | 0.05 | 0.09 | 0.09 | 0.14 | 0.14 | 0.16 | 0.05 | 0.19 | 0.09 | 0.02 |
| 5 | 6268 | 0.22 | 0.02 | 0.05 | 0.03 | 0.11 | 0.11 | 0.05 | 0.11 | 0.02 | 0.02 | 0.03 | 0.10 | 0.16 | 0.06 | 0.18 | 0.14 | 0.02 |
| 6 | 6408 | 0.05 | 0.05 | 0.06 | 0.06 | 0.11 | 0.14 | 0.16 | 0.05 | 0.03 | 0.02 | 0.09 | 0.06 | 0.08 | 0.11 | 0.17 | 0.14 | 0.03 |
| 7 | 5707 | 0.14 | 0.09 | 0.02 | NA | 0.07 | 0.07 | 0.07 | 0.09 | 0.04 | 0.04 | 0.04 | 0.09 | 0.11 | 0.09 | 0.25 | 0.11 | 0.07 |
| 8 | 6666 | 0.09 | 0.09 | 0.05 | 0.08 | 0.09 | 0.11 | 0.18 | 0.11 | 0.09 | 0.03 | 0.06 | 0.03 | 0.14 | 0.12 | 0.11 | 0.11 | 0.06 |
| 9 | 6000 | 0.23 | 0.07 | 0.05 | 0.08 | 0.15 | 0.05 | 0.05 | 0.12 | 0.07 | 0.03 | 0.08 | 0.07 | 0.03 | 0.13 | 0.17 | 0.08 | NA |
| 10 | 5351 | 0.09 | 0.04 | 0.04 | 0.07 | 0.09 | 0.06 | 0.11 | 0.02 | 0.06 | 0.07 | 0.02 | 0.15 | 0.13 | 0.06 | 0.19 | 0.09 | 0.04 |
| 11 | 3095 | 0.03 | 0.03 | 0.03 | 0.13 | 0.06 | 0.13 | 0.13 | 0.06 | 0.16 | 0.13 | 0.03 | 0.03 | 0.03 | 0.03 | 0.13 | 0.10 | 0.10 |
| 12 | 5620 | 0.12 | 0.04 | 0.07 | 0.02 | 0.12 | 0.12 | 0.07 | 0.05 | 0.07 | 0.07 | 0.09 | 0.04 | 0.11 | 0.09 | 0.14 | 0.14 | 0.02 |
| 13 | 5979 | 0.13 | 0.03 | NA | NA | 0.12 | 0.05 | 0.07 | 0.07 | 0.03 | NA | 0.08 | 0.08 | 0.03 | 0.12 | 0.12 | 0.05 | 0.02 |
| 14 | 4284 | 0.16 | 0.09 | NA | 0.05 | 0.14 | 0.07 | 0.16 | 0.05 | 0.09 | 0.05 | 0.05 | 0.07 | 0.07 | 0.05 | 0.16 | 0.05 | NA |
| 15 | 3245 | 0.06 | NA | 0.03 | 0.03 | 0.06 | 0.06 | 0.18 | 0.09 | 0.12 | 0.06 | 0.06 | 0.06 | NA | NA | 0.25 | 0.06 | 0.06 |
| 16 | 5093 | 0.12 | 0.08 | 0.14 | NA | 0.06 | 0.18 | 0.08 | 0.02 | 0.16 | 0.04 | 0.04 | 0.10 | 0.04 | NA | 0.12 | 0.08 | 0.02 |
| 17 | 2070 | 0.10 | 0.05 | NA | NA | NA | 0.14 | NA | NA | 0.10 | NA | 0.05 | 0.05 | NA | 0.14 | 0.14 | 0.14 | 0.05 |
| 18 | 2958 | 0.10 | NA | 0.14 | 0.07 | 0.14 | NA | NA | 0.07 | 0.07 | 0.17 | 0.03 | 0.14 | 0.07 | 0.14 | 0.14 | 0.20 | NA |
| 19 | 5071 | 0.10 | 0.20 | 0.08 | 0.08 | 0.10 | 0.20 | 0.12 | 0.04 | 0.02 | 0.14 | 0.10 | 0.08 | 0.12 | 0.04 | 0.16 | 0.06 | NA |
| 20 | 3797 | 0.08 | 0.08 | 0.05 | 0.03 | 0.03 | NA | 0.03 | 0.03 | 0.11 | 0.13 | 0.08 | 0.13 | 0.05 | 0.05 | 0.26 | 0.08 | NA |
| 21 | 2937 | 0.10 | 0.07 | 0.07 | 0.03 | 0.10 | 0.10 | 0.03 | 0.10 | 0.10 | 0.07 | 0.14 | 0.10 | 0.17 | 0.10 | 0.07 | 0.03 | NA |
| 22 | 3447 | 0.15 | 0.06 | 0.03 | 0.03 | 0.12 | 0.03 | 0.03 | 0.06 | 0.06 | 0.03 | 0.03 | 0.15 | 0.17 | 0.12 | 0.15 | 0.09 | NA |
| 23 | 1756 | 0.06 | NA | 0.06 | NA | 0.17 | 0.23 | 0.11 | 0.11 | NA | 0.11 | 0.06 | 0.11 | 0.23 | NA | 0.23 | 0.06 | NA |
| 24 | 2314 | 0.26 | 0.04 | 0.13 | 0.04 | 0.09 | 0.17 | 0.09 | 0.09 | 0.09 | NA | 0.13 | 0.13 | 0.04 | 0.09 | 0.09 | 0.22 | NA |
| 25 | 2980 | NA | 0.17 | 0.03 | NA | 0.07 | 0.10 | 0.13 | NA | 0.10 | 0.03 | 0.20 | 0.13 | 0.03 | 0.17 | 0.23 | 0.13 | NA |
| 26 | 4047 | 0.17 | 0.07 | 0.10 | 0.05 | 0.07 | 0.15 | 0.07 | 0.05 | 0.07 | 0.10 | 0.05 | 0.05 | 0.07 | 0.07 | 0.35 | 0.15 | NA |
| 27 | 1997 | 0.05 | 0.10 | NA | NA | 0.15 | 0.05 | 0.10 | 0.05 | 0.10 | 0.10 | 0.10 | 0.15 | 0.05 | 0.15 | 0.15 | 0.20 | 0.05 |
| 28 | 3291 | 0.21 | 0.12 | 0.03 | NA | 0.18 | NA | 0.03 | 0.06 | 0.03 | 0.06 | 0.06 | 0.03 | NA | 0.12 | 0.12 | 0.21 | NA |
| 29 | 2272 | NA | 0.04 | 0.04 | NA | 0.26 | 0.09 | 0.04 | 0.13 | 0.04 | 0.13 | 0.18 | 0.13 | 0.04 | 0.09 | 0.13 | 0.09 | NA |
| 30 | 3383 | 0.15 | 0.12 | 0.18 | NA | 0.06 | 0.09 | 0.06 | NA | 0.09 | NA | 0.09 | 0.09 | 0.06 | 0.06 | 0.09 | 0.12 | 0.06 |
| 31 | 2772 | NA | 0.04 | 0.04 | 0.18 | 0.22 | 0.04 | 0.22 | 0.04 | NA | 0.07 | 0.11 | 0.04 | 0.07 | 0.11 | 0.14 | 0.04 | NA |
| 32 | 1274 | 0.08 | NA | NA | 0.08 | NA | NA | 0.08 | NA | 0.08 | NA | NA | NA | 0.08 | 0.31 | 0.24 | 0.08 | 0.08 |
| 33 | 3255 | 0.06 | 0.03 | NA | 0.06 | NA | 0.09 | 0.09 | 0.06 | 0.06 | 0.06 | 0.03 | 0.09 | 0.06 | 0.22 | 0.40 | 0.22 | 0.03 |
| 34 | 1094 | NA | 0.09 | 0.09 | NA | 0.18 | 0.09 | 0.18 | NA | NA | NA | NA | NA | 0.18 | NA | 0.09 | 0.09 | NA |
Compare the mean proportion of loci per chromosome that are private polymorphisms across tributaries.
Table S18: Comparison across run/tributary group of the mean+/- sd, minimum, and maximum percent of loci on a chromosome that areonly polymorphic for a set of individuals from the same tributary of thesame run type.
| population | mean | sd | min | max |
|---|---|---|---|---|
| F_COL | 0.12 | 0.06 | 0.03 | 0.26 |
| F_MIL | 0.07 | 0.04 | 0.02 | 0.20 |
| F_DER | 0.06 | 0.04 | 0.02 | 0.18 |
| F_BUT | 0.06 | 0.04 | 0.02 | 0.18 |
| F_FRH | 0.11 | 0.05 | 0.03 | 0.26 |
| F_NIM | 0.11 | 0.05 | 0.03 | 0.23 |
| F_MKH | 0.10 | 0.05 | 0.02 | 0.22 |
| F_STN | 0.07 | 0.03 | 0.02 | 0.13 |
| F_TOU | 0.07 | 0.04 | 0.02 | 0.16 |
| F_MRH | 0.07 | 0.04 | 0.02 | 0.17 |
| F_MER | 0.08 | 0.04 | 0.02 | 0.20 |
| L_USR | 0.09 | 0.04 | 0.02 | 0.15 |
| W_USR | 0.09 | 0.05 | 0.03 | 0.23 |
| S_MIL | 0.10 | 0.06 | 0.02 | 0.31 |
| S_DER | 0.17 | 0.07 | 0.07 | 0.40 |
| S_BUT | 0.11 | 0.05 | 0.03 | 0.22 |
| S_FRH | 0.04 | 0.02 | 0.02 | 0.10 |
Generate a random null distribution to determine if loci with private alleles are non-randomly distributed across chromosomes.
Table S19: Chromosomes with significantly higher/lower thanexpected (outside simulated null distribution) number of fixed loci foreach run/tributary group
| GRP | CHROMOSOME | total | n | percent | min | max | sign |
|---|---|---|---|---|---|---|---|
| F_MIL | 34 | 1094 | 1 | 0.0914 | 0.0914 | 0.3656 | lower |
| F_DER | 34 | 1094 | 1 | 0.0914 | 0.0914 | 0.3656 | lower |
| F_NIM | 34 | 1094 | 1 | 0.0914 | 0.0914 | 0.4570 | lower |
| S_DER | 34 | 1094 | 1 | 0.0914 | 0.0914 | 0.4570 | lower |
| S_BUT | 34 | 1094 | 1 | 0.0914 | 0.0914 | 0.3656 | lower |
| F_MIL | 33 | 3255 | 1 | 0.0307 | 0.0307 | 0.2458 | lower |
| F_MER | 33 | 3255 | 1 | 0.0307 | 0.0307 | 0.2458 | lower |
| S_FRH | 33 | 3255 | 1 | 0.0307 | 0.0307 | 0.1229 | lower |
| F_COL | 32 | 1274 | 1 | 0.0785 | 0.0785 | 0.4710 | lower |
| F_BUT | 32 | 1274 | 1 | 0.0785 | 0.0785 | 0.3140 | lower |
| F_MKH | 32 | 1274 | 1 | 0.0785 | 0.0785 | 0.3140 | lower |
| F_TOU | 32 | 1274 | 1 | 0.0785 | 0.0785 | 0.3140 | lower |
| W_USR | 32 | 1274 | 1 | 0.0785 | 0.0785 | 0.3140 | lower |
| S_BUT | 32 | 1274 | 1 | 0.0785 | 0.0785 | 0.4710 | lower |
| S_FRH | 32 | 1274 | 1 | 0.0785 | 0.0785 | 0.2355 | lower |
| F_MIL | 31 | 2772 | 1 | 0.0361 | 0.0361 | 0.2886 | lower |
| F_DER | 31 | 2772 | 1 | 0.0361 | 0.0361 | 0.2165 | lower |
| F_BUT | 31 | 2772 | 5 | 0.1804 | 0.0361 | 0.1804 | higher |
| F_NIM | 31 | 2772 | 1 | 0.0361 | 0.0361 | 0.2886 | lower |
| F_STN | 31 | 2772 | 1 | 0.0361 | 0.0361 | 0.2165 | lower |
| L_USR | 31 | 2772 | 1 | 0.0361 | 0.0361 | 0.2886 | lower |
| S_BUT | 31 | 2772 | 1 | 0.0361 | 0.0361 | 0.3247 | lower |
| F_DER | 30 | 3383 | 6 | 0.1774 | 0.0296 | 0.1774 | higher |
| F_MIL | 29 | 2272 | 1 | 0.0440 | 0.0440 | 0.3081 | lower |
| F_DER | 29 | 2272 | 1 | 0.0440 | 0.0440 | 0.3081 | lower |
| F_MKH | 29 | 2272 | 1 | 0.0440 | 0.0440 | 0.3961 | lower |
| F_TOU | 29 | 2272 | 1 | 0.0440 | 0.0440 | 0.2641 | lower |
| W_USR | 29 | 2272 | 1 | 0.0440 | 0.0440 | 0.3521 | lower |
| F_DER | 28 | 3291 | 1 | 0.0304 | 0.0304 | 0.2127 | lower |
| F_MKH | 28 | 3291 | 1 | 0.0304 | 0.0304 | 0.2735 | lower |
| F_TOU | 28 | 3291 | 1 | 0.0304 | 0.0304 | 0.2127 | lower |
| L_USR | 28 | 3291 | 1 | 0.0304 | 0.0304 | 0.2127 | lower |
| F_COL | 27 | 1997 | 1 | 0.0501 | 0.0501 | 0.4507 | lower |
| F_NIM | 27 | 1997 | 1 | 0.0501 | 0.0501 | 0.5008 | lower |
| F_STN | 27 | 1997 | 1 | 0.0501 | 0.0501 | 0.3005 | lower |
| W_USR | 27 | 1997 | 1 | 0.0501 | 0.0501 | 0.4006 | lower |
| S_FRH | 27 | 1997 | 1 | 0.0501 | 0.0501 | 0.2504 | lower |
| F_DER | 25 | 2980 | 1 | 0.0336 | 0.0336 | 0.2349 | lower |
| F_MRH | 25 | 2980 | 1 | 0.0336 | 0.0336 | 0.2685 | lower |
| W_USR | 25 | 2980 | 1 | 0.0336 | 0.0336 | 0.2685 | lower |
| F_MIL | 24 | 2314 | 1 | 0.0432 | 0.0432 | 0.3025 | lower |
| F_BUT | 24 | 2314 | 1 | 0.0432 | 0.0432 | 0.2593 | lower |
| W_USR | 24 | 2314 | 1 | 0.0432 | 0.0432 | 0.4754 | lower |
| F_COL | 23 | 1756 | 1 | 0.0569 | 0.0569 | 0.5125 | lower |
| F_DER | 23 | 1756 | 1 | 0.0569 | 0.0569 | 0.2847 | lower |
| F_MER | 23 | 1756 | 1 | 0.0569 | 0.0569 | 0.3417 | lower |
| S_BUT | 23 | 1756 | 1 | 0.0569 | 0.0569 | 0.3986 | lower |
| F_DER | 22 | 3447 | 1 | 0.0290 | 0.0290 | 0.2031 | lower |
| F_BUT | 22 | 3447 | 1 | 0.0290 | 0.0290 | 0.2031 | lower |
| F_NIM | 22 | 3447 | 1 | 0.0290 | 0.0290 | 0.2901 | lower |
| F_MKH | 22 | 3447 | 1 | 0.0290 | 0.0290 | 0.2321 | lower |
| F_MRH | 22 | 3447 | 1 | 0.0290 | 0.0290 | 0.2031 | lower |
| F_MER | 22 | 3447 | 1 | 0.0290 | 0.0290 | 0.2031 | lower |
| F_BUT | 21 | 2937 | 1 | 0.0340 | 0.0340 | 0.2043 | lower |
| F_MKH | 21 | 2937 | 1 | 0.0340 | 0.0340 | 0.3745 | lower |
| S_BUT | 21 | 2937 | 1 | 0.0340 | 0.0340 | 0.3405 | lower |
| F_BUT | 20 | 3797 | 1 | 0.0263 | 0.0263 | 0.1580 | lower |
| F_FRH | 20 | 3797 | 1 | 0.0263 | 0.0263 | 0.2370 | lower |
| F_MKH | 20 | 3797 | 1 | 0.0263 | 0.0263 | 0.2634 | lower |
| F_STN | 20 | 3797 | 1 | 0.0263 | 0.0263 | 0.1844 | lower |
| F_TOU | 19 | 5071 | 1 | 0.0197 | 0.0197 | 0.1972 | lower |
| F_MER | 18 | 2958 | 1 | 0.0338 | 0.0338 | 0.3043 | lower |
| F_MIL | 17 | 2070 | 1 | 0.0483 | 0.0483 | 0.2415 | lower |
| F_MER | 17 | 2070 | 1 | 0.0483 | 0.0483 | 0.2899 | lower |
| L_USR | 17 | 2070 | 1 | 0.0483 | 0.0483 | 0.2415 | lower |
| S_FRH | 17 | 2070 | 1 | 0.0483 | 0.0483 | 0.1932 | lower |
| F_STN | 16 | 5093 | 1 | 0.0196 | 0.0196 | 0.1571 | lower |
| S_FRH | 16 | 5093 | 1 | 0.0196 | 0.0196 | 0.1178 | lower |
| F_DER | 15 | 3245 | 1 | 0.0308 | 0.0308 | 0.1849 | lower |
| F_BUT | 15 | 3245 | 1 | 0.0308 | 0.0308 | 0.1541 | lower |
| S_FRH | 13 | 5979 | 1 | 0.0167 | 0.0167 | 0.0669 | lower |
| F_BUT | 12 | 5620 | 1 | 0.0178 | 0.0178 | 0.1423 | lower |
| S_FRH | 12 | 5620 | 1 | 0.0178 | 0.0178 | 0.0890 | lower |
| F_COL | 11 | 3095 | 1 | 0.0323 | 0.0323 | 0.3554 | lower |
| F_MIL | 11 | 3095 | 1 | 0.0323 | 0.0323 | 0.2585 | lower |
| F_DER | 11 | 3095 | 1 | 0.0323 | 0.0323 | 0.1939 | lower |
| F_MER | 11 | 3095 | 1 | 0.0323 | 0.0323 | 0.2585 | lower |
| L_USR | 11 | 3095 | 1 | 0.0323 | 0.0323 | 0.2908 | lower |
| W_USR | 11 | 3095 | 1 | 0.0323 | 0.0323 | 0.2585 | lower |
| S_MIL | 11 | 3095 | 1 | 0.0323 | 0.0323 | 0.2908 | lower |
| F_STN | 10 | 5351 | 1 | 0.0187 | 0.0187 | 0.2056 | lower |
| F_MER | 10 | 5351 | 1 | 0.0187 | 0.0187 | 0.1869 | lower |
| F_DER | 7 | 5707 | 1 | 0.0175 | 0.0175 | 0.1752 | lower |
| F_MRH | 6 | 6408 | 1 | 0.0156 | 0.0156 | 0.1561 | lower |
| F_MIL | 5 | 6268 | 1 | 0.0160 | 0.0160 | 0.1755 | lower |
| F_TOU | 5 | 6268 | 1 | 0.0160 | 0.0160 | 0.1755 | lower |
| F_MRH | 5 | 6268 | 1 | 0.0160 | 0.0160 | 0.1755 | lower |
| S_FRH | 5 | 6268 | 1 | 0.0160 | 0.0160 | 0.0957 | lower |
| F_MKH | 4 | 5657 | 1 | 0.0177 | 0.0177 | 0.1944 | lower |
| S_FRH | 4 | 5657 | 1 | 0.0177 | 0.0177 | 0.0884 | lower |
| F_DER | 2 | 4869 | 1 | 0.0205 | 0.0205 | 0.1438 | lower |
| L_USR | 2 | 4869 | 1 | 0.0205 | 0.0205 | 0.2054 | lower |
| S_MIL | 2 | 4869 | 1 | 0.0205 | 0.0205 | 0.2054 | lower |
Table S20: Number of run/trib groups a chromosome has beenflagged as having significantly more/less loci with exclusivepolymorphisms than expected.
| CHROMOSOME | sign | n |
|---|---|---|
| 11 | lower | 7 |
| 32 | lower | 7 |
| 22 | lower | 6 |
| 31 | lower | 6 |
| 27 | lower | 5 |
| 29 | lower | 5 |
| 34 | lower | 5 |
| 5 | lower | 4 |
| 17 | lower | 4 |
| 20 | lower | 4 |
| 23 | lower | 4 |
| 28 | lower | 4 |
| 2 | lower | 3 |
| 21 | lower | 3 |
| 24 | lower | 3 |
| 25 | lower | 3 |
| 33 | lower | 3 |
| 4 | lower | 2 |
| 10 | lower | 2 |
| 12 | lower | 2 |
| 15 | lower | 2 |
| 16 | lower | 2 |
| 6 | lower | 1 |
| 7 | lower | 1 |
| 13 | lower | 1 |
| 18 | lower | 1 |
| 19 | lower | 1 |
| 30 | higher | 1 |
| 31 | higher | 1 |
Number of chromosomes with significantly more/less loci withexclusive polymorphisms than expected.
| GRP | sign | n |
|---|---|---|
| F_DER | lower | 11 |
| S_FRH | lower | 9 |
| F_MIL | lower | 8 |
| F_BUT | lower | 7 |
| F_MKH | lower | 7 |
| F_MER | lower | 7 |
| W_USR | lower | 6 |
| F_STN | lower | 5 |
| F_TOU | lower | 5 |
| L_USR | lower | 5 |
| S_BUT | lower | 5 |
| F_COL | lower | 4 |
| F_NIM | lower | 4 |
| F_MRH | lower | 4 |
| S_MIL | lower | 2 |
| F_DER | higher | 1 |
| F_BUT | higher | 1 |
| F_FRH | lower | 1 |
| S_DER | lower | 1 |
Comparison of private alleles among runs
For this analysis we will exclude hatchery individuals and combine wild individuals from each run and the subset to 21 individuals (number of individuals in Late Fall run data set) to identify the number of alleles that occur only in individuals of a given run.
# unset the seed to get individually different reps
set.seed(NULL)
# list for results
priv_allele_counts <- list()
# number of individuals to subset
n <- 21
# wild populations
wild <- strata %>%
filter(SOURCE == "WILD")
for(i in 1:100){
# draw up to n individuals from each population
sample_ind <- wild %>%
group_by(RUN) %>%
slice_sample(n = n)
# subset genind object
gen_subset <- gen[row.names(gen@tab) %in% sample_ind$LIB_ID]
# identify private alleles
alleles_private <- poppr::private_alleles(gen_subset,
alleles ~ RUN,
report = "data.frame")
# count no private alleles per population
priv_allele_counts[[i]] <- alleles_private %>%
filter(count > 0) %>%
count(population) %>%
mutate(rep = i)
}
results <- ldply(priv_allele_counts, data.frame)
write_delim(results, "results/priv-alleles_wild_ind21_reps100.txt",
delim = "\t")
Calculate the mean number of private alleles across all bootstraps.
Number of private alleles occurring in each set of individualsgrouped by run type within tributaries. All populations except for S_FRHwere subsampled down to 15 individuals.
| population | mean | sd |
|---|---|---|
| Fall | 861.22 | 33.61234 |
| Late-Fall | 807.24 | 21.58381 |
| Winter | 385.09 | 14.27090 |
| Spring | 1094.61 | 58.49543 |
Package versions used for this analysis (R 4.0.5).
## package loadedversion
## ade4 ade4 1.7-19
## adegenet adegenet 2.1.7
## ape ape 5.6-2
## assigner assigner 0.5.8
## coin coin 1.4-3
## dplyr dplyr 1.0.9
## forcats forcats 0.5.1
## ggplot2 ggplot2 3.3.6
## glue glue 1.6.2
## hierfstat hierfstat 0.5-11
## httr httr 1.4.3
## knitr knitr 1.39
## lattice lattice 0.20-45
## magrittr magrittr 2.0.3
## pegas pegas 1.1
## permute permute 0.9-7
## plyr plyr 1.8.7
## poppr poppr 2.9.3
## purrr purrr 0.3.4
## radiator radiator 1.2.2
## readr readr 2.1.2
## stringr stringr 1.4.0
## survival survival 3.3-1
## tibble tibble 3.1.7
## tidyr tidyr 1.2.0
## tidyverse tidyverse 1.3.1
## tint tint 0.1.3
## tufte tufte 0.12
## UpSetR UpSetR 1.4.0
## vcfR vcfR 1.15.0
## vegan vegan 2.6-2